1.Composer setup
Install Guzzle for HTTP and symfony/dom-crawler plus symfony/css-selector for parsing. You do not need the full Symfony kernel for a scrape script.
12composer require guzzlehttp/guzzle
composer require symfony/dom-crawler symfony/css-selector
2.Fetch with Guzzle
Guzzle wraps cURL with a sane API. Set timeouts explicitly — PHP scripts on cron have no supervisor restarting hung processes.
1234567891011<?php
require 'vendor/autoload.php';
use GuzzleHttp\Client;
$client = new Client(['timeout' => 30]);
$response = $client->get('https://books.toscrape.com/catalogue/page-1.html');
$html = (string) $response->getBody();
file_put_contents('page.html', $html);
echo 'Saved ' . strlen($html) . " bytes\n";
3.Parse with DomCrawler
DomCrawler filters nodes with CSS selectors via symfony/css-selector. filter() returns a new crawler scoped to matches; each() walks them.
123456789101112131415use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler($html);
$books = [];
$crawler->filter('article.product_pod')->each(function (Crawler $card) use (&$books) {
$books[] = [
'title' => $card->filter('h3 a')->attr('title'),
'price' => $card->filter('.price_color')->text(''),
'in_stock' => str_contains($card->filter('.instock')->text(''), 'In stock'),
];
});
echo count($books) . " books found\n";
print_r(array_slice($books, 0, 2));
4.Pagination in a cron-friendly loop
Chunk work across cron ticks if you scrape large catalogs on shared hosting — store the last page in a database row and resume next run. For open demo sites, a simple while loop suffices.
123456789101112131415161718192021222324252627$all = [];
$page = 1;
while (true) {
$url = "https://books.toscrape.com/catalogue/page-{$page}.html";
try {
$res = $client->get($url);
} catch (\GuzzleHttp\Exception\ClientException $e) {
if ($e->getResponse()->getStatusCode() === 404) break;
throw $e;
}
$c = new Crawler((string) $res->getBody());
$cards = $c->filter('article.product_pod');
if ($cards->count() === 0) break;
$cards->each(function (Crawler $card) use (&$all) {
$all[] = [
'title' => $card->filter('h3 a')->attr('title'),
'price' => $card->filter('.price_color')->text(''),
];
});
echo "Page {$page}: " . count($all) . " total\n";
$page++;
sleep(2);
}
5.When Guzzle returns challenge pages
A 200 response with "Checking your browser" in the body is worse than a 403 — your parser runs happily and saves garbage. Detect challenge markers early or route protected domains straight to OmniScrape.
See Cloudflare bypass for why header tweaks stop working on production retailers.
6.Guzzle + OmniScrape
POST JSON to https://api.omniscrape.io/v1/scrape. Pass the API key in X-API-Key. Feed data.content into DomCrawler — selectors unchanged.
12345678910111213141516171819202122232425$apiKey = getenv('OMNISCRAPE_KEY');
$response = $client->post('https://api.omniscrape.io/v1/scrape', [
'headers' => ['X-API-Key' => $apiKey],
'json' => [
'url' => 'https://protected-shop.com/product/9912',
'mode' => 'auto',
'output_format' => 'html',
],
'timeout' => 120,
]);
$body = json_decode($response->getBody(), true);
if (!$body['success']) {
throw new RuntimeException('Scrape failed: ' . json_encode($body));
}
$html = $body['data']['content'];
$crawler = new Crawler($html);
$price = $crawler->filter('.product-price')->text('NOT FOUND');
echo "Price: {$price}\n";
echo 'Method: ' . $body['metadata']['method_used']
. ', cost: $' . $body['billing']['charged'] . "\n";
7.Laravel Artisan command pattern
Wrap the Guzzle call in a command and schedule it in routes/console.php. Inject the HTTP client via the container; store results with Eloquent bulk inserts instead of one save() per row.
123456789101112131415161718// app/Console/Commands/ScrapeCatalog.php
public function handle(Client $client): int
{
$response = $client->post('https://api.omniscrape.io/v1/scrape', [
'headers' => ['X-API-Key' => config('services.omniscrape.key')],
'json' => [
'url' => $this->argument('url'),
'mode' => 'auto',
'output_format' => 'css_extractor',
'css_selectors' => ['title' => 'h1', 'price' => '.price'],
],
'timeout' => 120,
]);
$data = json_decode($response->getBody(), true)['data']['css_extracted'] ?? [];
Product::upsert([$data], ['sku'], ['title', 'price']);
return self::SUCCESS;
}
8.SPAs and js_rendering
DomCrawler sees whatever HTML arrives. SPAs that load prices via fetch() need mode:js_rendering. Read scraping JavaScript-rendered pages before burning credits on empty shells.
1234567'json' => [
'url' => 'https://spa-store.com/listing',
'mode' => 'js_rendering',
'output_format' => 'html',
'js_wait_selector' => '.product-card',
'js_wait_timeout' => 12000,
],
9.PHP-specific pitfalls
A few issues bite PHP scrapers more often than other stacks:
- Never commit .env with OMNISCRAPE_KEY — use getenv() or Laravel config
- Use LONGTEXT or S3 for raw HTML, not VARCHAR(255)
- Increase max_execution_time only for CLI; web requests should not scrape
- Validate non-empty extracted fields before INSERT — silent nulls poison analytics
10.Handle API responses
Check Guzzle status and JSON success separately:
- 401 — bad key; fix .env, stop cron until resolved
- 402 — out of balance; email ops, pause schedule
- 429 — sleep and retry with backoff
- 502 — retry up to 3 times
- success:false — log URL, skip retry loop
Frequently asked questions
Guzzle or PHP cURL extension?
Guzzle for readability and exception types. Raw cURL works in constrained hosting but is harder to maintain. OmniScrape integration looks the same either way.
DomCrawler or DiDOM?
DomCrawler if you already use Symfony components. DiDOM is lighter for standalone scripts. Both parse static HTML only.
Can I scrape from WordPress wp-cron?
Yes for small jobs, but wp-cron is unreliable on low-traffic sites. Use system cron calling php artisan or a standalone script instead.
How do I scrape logged-in pages?
Public pages behind bot walls use Web Unlocker. Pages behind your own login need Browser-as-a-Service with a scripted flow. Do not scrape data you are not authorized to access.
Why use css_extractor instead of DomCrawler?
Less PHP code and fewer places for layout changes to break silently. Keep DomCrawler when you need complex table traversal or archiving full HTML.
Related guides