Web Scraping with PHP

PHP powers a huge slice of the web, and a surprising amount of scraping still runs as Artisan commands or plain cron scripts on shared hosting. Guzzle is the HTTP client everyone knows; Symfony DomCrawler turns HTML into traversable nodes without pulling in a full framework.

The ceiling arrives fast: max_execution_time, datacenter IPs blocked on the first real retailer, and LONGTEXT columns that silently truncate your HTML dumps. This guide walks from a Guzzle fetch through DomCrawler extraction, then switches the fetch layer to the OmniScrape API when Cloudflare shows up. Compare with web scraping with Python if your team is split across languages.

1.Composer setup

Install Guzzle for HTTP and symfony/dom-crawler plus symfony/css-selector for parsing. You do not need the full Symfony kernel for a scrape script.

terminal

bash

12composer require guzzlehttp/guzzle
composer require symfony/dom-crawler symfony/css-selector

2.Fetch with Guzzle

Guzzle wraps cURL with a sane API. Set timeouts explicitly — PHP scripts on cron have no supervisor restarting hung processes.

fetch.php

php

1234567891011<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client(['timeout' => 30]);
$response = $client->get('https://books.toscrape.com/catalogue/page-1.html');
$html = (string) $response->getBody();

file_put_contents('page.html', $html);
echo 'Saved ' . strlen($html) . " bytes\n";

3.Parse with DomCrawler

DomCrawler filters nodes with CSS selectors via symfony/css-selector. filter() returns a new crawler scoped to matches; each() walks them.

parse.php

php

123456789101112131415use Symfony\Component\DomCrawler\Crawler;

$crawler = new Crawler($html);
$books = [];

$crawler->filter('article.product_pod')->each(function (Crawler $card) use (&$books) {
    $books[] = [
        'title' => $card->filter('h3 a')->attr('title'),
        'price' => $card->filter('.price_color')->text(''),
        'in_stock' => str_contains($card->filter('.instock')->text(''), 'In stock'),
    ];
});

echo count($books) . " books found\n";
print_r(array_slice($books, 0, 2));

4.Pagination in a cron-friendly loop

Chunk work across cron ticks if you scrape large catalogs on shared hosting — store the last page in a database row and resume next run. For open demo sites, a simple while loop suffices.

paginate.php

php

123456789101112131415161718192021222324252627$all = [];
$page = 1;

while (true) {
    $url = "https://books.toscrape.com/catalogue/page-{$page}.html";
    try {
        $res = $client->get($url);
    } catch (\GuzzleHttp\Exception\ClientException $e) {
        if ($e->getResponse()->getStatusCode() === 404) break;
        throw $e;
    }

    $c = new Crawler((string) $res->getBody());
    $cards = $c->filter('article.product_pod');
    if ($cards->count() === 0) break;

    $cards->each(function (Crawler $card) use (&$all) {
        $all[] = [
            'title' => $card->filter('h3 a')->attr('title'),
            'price' => $card->filter('.price_color')->text(''),
        ];
    });

    echo "Page {$page}: " . count($all) . " total\n";
    $page++;
    sleep(2);
}

5.When Guzzle returns challenge pages

A 200 response with "Checking your browser" in the body is worse than a 403 — your parser runs happily and saves garbage. Detect challenge markers early or route protected domains straight to OmniScrape.

See Cloudflare bypass for why header tweaks stop working on production retailers.

6.Guzzle + OmniScrape

POST JSON to https://api.omniscrape.io/v1/scrape. Pass the API key in X-API-Key. Feed data.content into DomCrawler — selectors unchanged.

omniscrape.php

php

12345678910111213141516171819202122232425$apiKey = getenv('OMNISCRAPE_KEY');

$response = $client->post('https://api.omniscrape.io/v1/scrape', [
    'headers' => ['X-API-Key' => $apiKey],
    'json' => [
        'url' => 'https://protected-shop.com/product/9912',
        'mode' => 'auto',
        'output_format' => 'html',
    ],
    'timeout' => 120,
]);

$body = json_decode($response->getBody(), true);

if (!$body['success']) {
    throw new RuntimeException('Scrape failed: ' . json_encode($body));
}

$html = $body['data']['content'];
$crawler = new Crawler($html);
$price = $crawler->filter('.product-price')->text('NOT FOUND');

echo "Price: {$price}\n";
echo 'Method: ' . $body['metadata']['method_used']
    . ', cost: $' . $body['billing']['charged'] . "\n";

7.Laravel Artisan command pattern

Wrap the Guzzle call in a command and schedule it in routes/console.php. Inject the HTTP client via the container; store results with Eloquent bulk inserts instead of one save() per row.

ScrapeCatalog.php

php

123456789101112131415161718// app/Console/Commands/ScrapeCatalog.php
public function handle(Client $client): int
{
    $response = $client->post('https://api.omniscrape.io/v1/scrape', [
        'headers' => ['X-API-Key' => config('services.omniscrape.key')],
        'json' => [
            'url' => $this->argument('url'),
            'mode' => 'auto',
            'output_format' => 'css_extractor',
            'css_selectors' => ['title' => 'h1', 'price' => '.price'],
        ],
        'timeout' => 120,
    ]);

    $data = json_decode($response->getBody(), true)['data']['css_extracted'] ?? [];
    Product::upsert([$data], ['sku'], ['title', 'price']);
    return self::SUCCESS;
}

8.SPAs and js_rendering

DomCrawler sees whatever HTML arrives. SPAs that load prices via fetch() need mode:js_rendering. Read scraping JavaScript-rendered pages before burning credits on empty shells.

js_rendering payload

php

1234567'json' => [
    'url' => 'https://spa-store.com/listing',
    'mode' => 'js_rendering',
    'output_format' => 'html',
    'js_wait_selector' => '.product-card',
    'js_wait_timeout' => 12000,
],

9.PHP-specific pitfalls

A few issues bite PHP scrapers more often than other stacks:

Never commit .env with OMNISCRAPE_KEY — use getenv() or Laravel config
Use LONGTEXT or S3 for raw HTML, not VARCHAR(255)
Increase max_execution_time only for CLI; web requests should not scrape
Validate non-empty extracted fields before INSERT — silent nulls poison analytics

10.Handle API responses

Check Guzzle status and JSON success separately:

401 — bad key; fix .env, stop cron until resolved
402 — out of balance; email ops, pause schedule
429 — sleep and retry with backoff
502 — retry up to 3 times
success:false — log URL, skip retry loop

Frequently asked questions

Guzzle or PHP cURL extension?

Guzzle for readability and exception types. Raw cURL works in constrained hosting but is harder to maintain. OmniScrape integration looks the same either way.

DomCrawler or DiDOM?

DomCrawler if you already use Symfony components. DiDOM is lighter for standalone scripts. Both parse static HTML only.

Can I scrape from WordPress wp-cron?

Yes for small jobs, but wp-cron is unreliable on low-traffic sites. Use system cron calling php artisan or a standalone script instead.

How do I scrape logged-in pages?

Public pages behind bot walls use Web Unlocker. Pages behind your own login need Browser-as-a-Service with a scripted flow. Do not scrape data you are not authorized to access.

Why use css_extractor instead of DomCrawler?

Less PHP code and fewer places for layout changes to break silently. Keep DomCrawler when you need complex table traversal or archiving full HTML.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.