OmniScrape
ProductsSolutionsGuidesDocs ↗PricingAbout
ProductsSolutionsGuidesDocs ↗PricingAbout
← All guides
Web Scraping by Language

Web Scraping with PHP

PHP powers a huge slice of the web, and a surprising amount of scraping still runs as Artisan commands or plain cron scripts on shared hosting. Guzzle is the HTTP client everyone knows; Symfony DomCrawler turns HTML into traversable nodes without pulling in a full framework.

The ceiling arrives fast: max_execution_time, datacenter IPs blocked on the first real retailer, and LONGTEXT columns that silently truncate your HTML dumps. This guide walks from a Guzzle fetch through DomCrawler extraction, then switches the fetch layer to the OmniScrape API when Cloudflare shows up. Compare with web scraping with Python if your team is split across languages.

On this page

1. Composer setup2. Fetch with Guzzle3. Parse with DomCrawler4. Pagination in a cron-friendly loop5. When Guzzle returns challenge pages6. Guzzle + OmniScrape7. Laravel Artisan command pattern8. SPAs and js_rendering9. PHP-specific pitfalls10. Handle API responses11. FAQ

1.Composer setup

Install Guzzle for HTTP and symfony/dom-crawler plus symfony/css-selector for parsing. You do not need the full Symfony kernel for a scrape script.

terminal
bash
12composer require guzzlehttp/guzzle
composer require symfony/dom-crawler symfony/css-selector

2.Fetch with Guzzle

Guzzle wraps cURL with a sane API. Set timeouts explicitly — PHP scripts on cron have no supervisor restarting hung processes.

fetch.php
php
1234567891011<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client(['timeout' => 30]);
$response = $client->get('https://books.toscrape.com/catalogue/page-1.html');
$html = (string) $response->getBody();

file_put_contents('page.html', $html);
echo 'Saved ' . strlen($html) . " bytes\n";

3.Parse with DomCrawler

DomCrawler filters nodes with CSS selectors via symfony/css-selector. filter() returns a new crawler scoped to matches; each() walks them.

parse.php
php
123456789101112131415use Symfony\Component\DomCrawler\Crawler;

$crawler = new Crawler($html);
$books = [];

$crawler->filter('article.product_pod')->each(function (Crawler $card) use (&$books) {
    $books[] = [
        'title' => $card->filter('h3 a')->attr('title'),
        'price' => $card->filter('.price_color')->text(''),
        'in_stock' => str_contains($card->filter('.instock')->text(''), 'In stock'),
    ];
});

echo count($books) . " books found\n";
print_r(array_slice($books, 0, 2));

4.Pagination in a cron-friendly loop

Chunk work across cron ticks if you scrape large catalogs on shared hosting — store the last page in a database row and resume next run. For open demo sites, a simple while loop suffices.

paginate.php
php
123456789101112131415161718192021222324252627$all = [];
$page = 1;

while (true) {
    $url = "https://books.toscrape.com/catalogue/page-{$page}.html";
    try {
        $res = $client->get($url);
    } catch (\GuzzleHttp\Exception\ClientException $e) {
        if ($e->getResponse()->getStatusCode() === 404) break;
        throw $e;
    }

    $c = new Crawler((string) $res->getBody());
    $cards = $c->filter('article.product_pod');
    if ($cards->count() === 0) break;

    $cards->each(function (Crawler $card) use (&$all) {
        $all[] = [
            'title' => $card->filter('h3 a')->attr('title'),
            'price' => $card->filter('.price_color')->text(''),
        ];
    });

    echo "Page {$page}: " . count($all) . " total\n";
    $page++;
    sleep(2);
}

5.When Guzzle returns challenge pages

A 200 response with "Checking your browser" in the body is worse than a 403 — your parser runs happily and saves garbage. Detect challenge markers early or route protected domains straight to OmniScrape.

See Cloudflare bypass for why header tweaks stop working on production retailers.

6.Guzzle + OmniScrape

POST JSON to https://api.omniscrape.io/v1/scrape. Pass the API key in X-API-Key. Feed data.content into DomCrawler — selectors unchanged.

omniscrape.php
php
12345678910111213141516171819202122232425$apiKey = getenv('OMNISCRAPE_KEY');

$response = $client->post('https://api.omniscrape.io/v1/scrape', [
    'headers' => ['X-API-Key' => $apiKey],
    'json' => [
        'url' => 'https://protected-shop.com/product/9912',
        'mode' => 'auto',
        'output_format' => 'html',
    ],
    'timeout' => 120,
]);

$body = json_decode($response->getBody(), true);

if (!$body['success']) {
    throw new RuntimeException('Scrape failed: ' . json_encode($body));
}

$html = $body['data']['content'];
$crawler = new Crawler($html);
$price = $crawler->filter('.product-price')->text('NOT FOUND');

echo "Price: {$price}\n";
echo 'Method: ' . $body['metadata']['method_used']
    . ', cost: $' . $body['billing']['charged'] . "\n";

7.Laravel Artisan command pattern

Wrap the Guzzle call in a command and schedule it in routes/console.php. Inject the HTTP client via the container; store results with Eloquent bulk inserts instead of one save() per row.

ScrapeCatalog.php
php
123456789101112131415161718// app/Console/Commands/ScrapeCatalog.php
public function handle(Client $client): int
{
    $response = $client->post('https://api.omniscrape.io/v1/scrape', [
        'headers' => ['X-API-Key' => config('services.omniscrape.key')],
        'json' => [
            'url' => $this->argument('url'),
            'mode' => 'auto',
            'output_format' => 'css_extractor',
            'css_selectors' => ['title' => 'h1', 'price' => '.price'],
        ],
        'timeout' => 120,
    ]);

    $data = json_decode($response->getBody(), true)['data']['css_extracted'] ?? [];
    Product::upsert([$data], ['sku'], ['title', 'price']);
    return self::SUCCESS;
}

8.SPAs and js_rendering

DomCrawler sees whatever HTML arrives. SPAs that load prices via fetch() need mode:js_rendering. Read scraping JavaScript-rendered pages before burning credits on empty shells.

js_rendering payload
php
1234567'json' => [
    'url' => 'https://spa-store.com/listing',
    'mode' => 'js_rendering',
    'output_format' => 'html',
    'js_wait_selector' => '.product-card',
    'js_wait_timeout' => 12000,
],

9.PHP-specific pitfalls

A few issues bite PHP scrapers more often than other stacks:

  • Never commit .env with OMNISCRAPE_KEY — use getenv() or Laravel config
  • Use LONGTEXT or S3 for raw HTML, not VARCHAR(255)
  • Increase max_execution_time only for CLI; web requests should not scrape
  • Validate non-empty extracted fields before INSERT — silent nulls poison analytics

10.Handle API responses

Check Guzzle status and JSON success separately:

  • 401 — bad key; fix .env, stop cron until resolved
  • 402 — out of balance; email ops, pause schedule
  • 429 — sleep and retry with backoff
  • 502 — retry up to 3 times
  • success:false — log URL, skip retry loop

Frequently asked questions

Guzzle or PHP cURL extension?

Guzzle for readability and exception types. Raw cURL works in constrained hosting but is harder to maintain. OmniScrape integration looks the same either way.

DomCrawler or DiDOM?

DomCrawler if you already use Symfony components. DiDOM is lighter for standalone scripts. Both parse static HTML only.

Can I scrape from WordPress wp-cron?

Yes for small jobs, but wp-cron is unreliable on low-traffic sites. Use system cron calling php artisan or a standalone script instead.

How do I scrape logged-in pages?

Public pages behind bot walls use Web Unlocker. Pages behind your own login need Browser-as-a-Service with a scripted flow. Do not scrape data you are not authorized to access.

Why use css_extractor instead of DomCrawler?

Less PHP code and fewer places for layout changes to break silently. Keep DomCrawler when you need complex table traversal or archiving full HTML.

Related guides

  • Web Scraping with Python
  • How to Bypass Cloudflare When Web Scraping
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

Ready to get started?

Start scraping protected sites today — no credit card required.

OmniScrape

Web scraping infrastructure for developers. One API call to bypass any protection.

All systems operational

Product

  • Web Unlocker
  • Browser-as-a-Service
  • Residential Proxies
  • Pricing

Developers

  • API Reference ↗
  • Quickstart ↗
  • All Guides
  • Use Cases
  • Status

Company

  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • Cookie Policy
  • Acceptable Use

Solutions

  • E-commerce Web Scraping: Catalog Intelligence at Production Scale
  • Real Estate Web Scraping: Listings, Comps, and Market Data
  • SERP Web Scraping: Agency Rank Tracking Workflow
  • Job Board Web Scraping: HR Tech Pipeline for Labor Market Intelligence
  • Price Monitoring with Web Scraping: A Practical Developer Guide
  • Lead Generation Web Scraping: Compliant Inbound Enrichment for Sales Teams
  • Market Research Web Scraping: Multi-Geo Data Collection for Research Firms
  • Sentiment Analysis Web Scraping: Build a Production Review Pipeline
  • Logistics Web Scraping: Carrier Rates, Port ETAs, and Sailing Schedules
  • Social Media Web Scraping: Brand Mention Monitoring from Public Pages
  • LLM Training Data Scraping: Building Clean Web Corpora
  • Travel Web Scraping: Hotel Rates, Flight Fares & Parity Monitoring

Web Scraping by Language

  • Web Scraping with Python
  • Web Scraping with Node.js: fetch, Cheerio, and the OmniScrape API
  • Web Scraping with Java: HttpClient, Jsoup, and OmniScrape API
  • Web Scraping with PHP
  • Web Scraping with Go (Golang)
  • Web Scraping with Ruby: Faraday, Nokogiri, Sidekiq & OmniScrape
  • Web Scraping with C#: HttpClient, AngleSharp, and OmniScrape API
  • Web Scraping with Rust
  • Web Scraping with R: httr2, rvest, and the OmniScrape API
  • Web Scraping with C++
  • Web Scraping with Elixir
  • Web Scraping with Perl: Mojo::UserAgent, Mojo::DOM, and OmniScrape

Anti-Bot Bypass

  • How to Bypass Cloudflare When Web Scraping
  • How to Bypass DataDome When Web Scraping
  • How to Bypass Akamai Bot Manager When Web Scraping
  • How to Bypass PerimeterX (HUMAN Security) When Web Scraping
  • Bypassing AWS WAF When Web Scraping: Rate Rules, Bot Control, and Residential Proxies
  • How to Bypass Imperva (Incapsula) When Web Scraping
  • How to Bypass Kasada Bot Protection When Web Scraping
  • How to Bypass F5 BIG-IP Bot Defense When Web Scraping
  • How to Bypass Distil Networks When Web Scraping
  • How to Bypass reCAPTCHA When Web Scraping

Scraping Tools

  • Playwright Web Scraping: Practical Patterns for Protected Sites
  • Puppeteer Web Scraping: Patterns, Anti-Bot Limits, and BaaS Integration
  • Selenium Web Scraping: Practical Patterns for Real-World Projects
  • Scrapy Web Scraping with OmniScrape: Download Middleware, Pipelines, and Scale
  • Beautiful Soup Web Scraping: A Practical Guide
  • cURL Web Scraping: Shell-Native Patterns with OmniScrape
  • HTTPX Web Scraping: Async Python with OmniScrape
  • Cheerio Web Scraping: A Practical Guide

Site-Specific Scrapers

  • Amazon Scraper: Product Data, Buy Box, Reviews, and Multi-Marketplace
  • Google Search Scraper: Extract SERP Rankings and Features
  • Google Maps Scraper: Extract Business Listings and Place Data
  • LinkedIn Scraper: Companies, Jobs, and Public Profiles
  • Walmart Scraper: Prices, Stock, Rollback Deals, and Fulfillment Data
  • eBay Scraper: Extract Listings, Auctions, and Sold Prices
  • Shopify Scraper: Products, Variants, and JSON Endpoints
  • Indeed Scraper: Extract Job Listings, Salaries, and Company Data
  • Zillow Scraper: Extract Listings, Zestimates, and Price History
  • Reddit Scraper: Posts, Comments, and Subreddit Data
  • X (Twitter) Scraper: Tweets, Profiles, and Hashtags
  • Instagram Scraper: Posts, Reels, and Profile Metrics
  • TikTok Scraper: Extract Videos, Hashtags, and Trend Data
  • YouTube Scraper: Extract Video Metadata, Comments, and Channel Stats
  • Booking.com Scraper: Hotel Rates, Room Types, and Availability
  • Airbnb Scraper: Listings, Calendars, and Nightly Rates
  • Crunchbase Scraper: Extract Funding Rounds, Companies, and Investors
  • Yelp Scraper: Extract Business Listings, Ratings, and Reviews
  • Glassdoor Scraper: Employer Ratings, Salaries, and Review Data
  • Trustpilot Scraper: TrustScore, Star Distribution, and Review Monitoring

How We Compare

  • OmniScrape vs ScrapingBee
  • OmniScrape vs ZenRows
  • OmniScrape vs ScraperAPI: A Practical Developer Comparison
  • OmniScrape vs Bright Data: Which Web Scraping Platform Fits Your Team?
  • OmniScrape vs Oxylabs
  • OmniScrape vs Smartproxy
  • OmniScrape vs Crawlbase: API Design, Observability, and Migration Guide
  • OmniScrape vs Apify

Web Scraping Guides

  • Web Scraping Without Getting Blocked
  • Web Scraping Proxy Guide: Types, Sessions, Geo, and OmniScrape Integration
  • Solve CAPTCHAs While Web Scraping
  • Web Scraping vs Web Crawling: Architecture, Patterns, and When to Use Each
  • Headless Browser Scraping: When to Use It and How to Do It Right
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns
  • Rotating Proxies for Web Scraping: Policies, Session Binding, and Geo Pools
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs

© 2026 OmniScrape. All rights reserved.

PrivacyTermsRefundsAcceptable Use