E-commerce Web Scraping: Catalog Intelligence at Production Scale

1.Production workflow: catalog refresh cycle

Monday 06:00 UTC — category crawlers pull PDP URLs from XML sitemaps and from the previous week's crawl output, deduplicating by canonical URL. Workers call the OmniScrape API with retailer-specific css_selectors profiles stored in a config registry. A validator rejects any row missing price, SKU, or in-stock status and routes it to a dead-letter queue. The diff engine compares each row against yesterday's snapshot and posts only moves exceeding 3% on hero SKUs to the pricing Slack channel — noise suppression is as important as coverage.

Friday — harder retailers (flash-sale WAF tightening, new bot challenge deployments) get a manual review of metadata.method_used across the week's runs. If the js_rendering ratio for a given retailer jumps from 12% to 40%, that signals a layout or challenge change that needs selector or proxy-country tuning before peak traffic events like Black Friday or Prime Day. Catching this on Friday gives the team a weekend buffer to fix selectors without impacting Monday's full catalog run.

This two-cadence model — automated daily runs plus weekly human review of method telemetry — is what separates stable price intelligence operations from brittle one-off scrapers that break silently.

2.Warehouse schema design for price history

Store one immutable row per SKU per scrape timestamp. Use a composite natural key of retailer_id + sku — never deduplicate by title text, which retailers change freely for SEO reasons. Keep price in integer cents to avoid floating-point comparison bugs in diff queries. Archive scrape_mode and scrape_cost_usd on every row so you can attribute infrastructure cost to specific retailers and justify budget to stakeholders.

The schema below maps directly to a BigQuery or Postgres append-only table. Partition by scraped_at date for efficient range scans in dbt diff models.

Warehouse row — BigQuery / Postgres

json

1234567891011121314151617{
  "retailer_id": "ret_us_electronics_01",
  "sku": "WH-8842-XL",
  "url": "https://competitor.com/p/wh-8842-xl",
  "title": "Wireless Headphones Pro",
  "price_cents": 7999,
  "was_price_cents": 9999,
  "currency": "USD",
  "in_stock": true,
  "stock_label": "In Stock",
  "rating": 4.6,
  "review_count": 1284,
  "scraped_at": "2026-06-23T06:14:22Z",
  "scrape_mode": "fast",
  "solver_used": false,
  "scrape_cost_usd": 0.0035
}

3.OmniScrape API request for PDP extraction

Use mode: auto as the default for all retailers. Auto tries the fast HTTP lane first and escalates to a headless browser only when the response signals a bot challenge or the price selector comes back empty. This keeps costs low for simple Shopify storefronts while handling Magento + Cloudflare stacks automatically without per-retailer mode overrides in your config.

Set js_wait_selector to the price element's CSS selector when you know the retailer lazy-loads pricing. js_wait_timeout of 8000ms covers most React hydration cycles; increase to 12000ms for retailers with slow CDN edge caching. Match the proxy country to the storefront currency region you are pricing against — a US proxy on a DE storefront will return EUR prices but may surface different stock levels or promotional pricing than a DE residential IP.

The css_selectors map is evaluated server-side by OmniScrape and returned in body.data.css_extracted, so your worker receives clean key-value pairs rather than raw HTML to parse.

PDP extraction request

json

12345678910111213141516171819202122POST https://api.omniscrape.io/v1/scrape
X-API-Key: YOUR_KEY
Content-Type: application/json

{
  "url": "https://competitor.com/p/wh-8842-xl",
  "mode": "auto",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "enable_solver": true,
  "css_selectors": {
    "sku": "[data-product-sku]",
    "title": "h1.product-title",
    "price": "[itemprop='price']",
    "was_price": "[class*='was-price']",
    "in_stock": ".availability",
    "rating": "[itemprop='ratingValue']",
    "review_count": "[itemprop='reviewCount']"
  },
  "js_wait_selector": "[itemprop='price']",
  "js_wait_timeout": 8000
}

4.Pipeline architecture: queue to warehouse

Sitemap ingest → URL normalisation and dedup → URL queue (SQS or Redis ZSET with priority score) → scrape workers (POST /v1/scrape, concurrency capped at 5 in-flight per retailer domain) → response validator (rejects missing price or SKU) → raw HTML archive (S3, 7-day TTL, keyed by url hash + timestamp) → structured rows written to Postgres → nightly dbt diff model computes price delta vs previous row → alert fanout (PagerDuty for >10% hero SKU moves, Slack digest for catalog-wide summary) → pricing dashboard (Looker or Metabase).

Dead-letter queue captures two failure classes: explicit success:false responses (bot block, timeout, 5xx) and silent failures where success is true but the price field is null or empty. Silent failures are more dangerous because they do not trigger block-rate alerts — a retailer layout change can silently zero out prices for hours before anyone notices. Replay DLQ URLs after selector fixes without re-running the full catalog; this also lets analysts test new css_selectors against known-failing URLs before promoting to production config.

For the raw HTML archive: storing the full body.data.content response on S3 is cheap insurance. When a retailer redesigns their PDP, you can diff the archived HTML from the last successful scrape against today's failure to identify exactly which DOM nodes moved, without making additional API calls.

5.Variant matrices and URL explosion

Color and size variants can multiply a 50,000 SKU catalog into 500,000 URLs if you crawl every swatch combination. Before doing that, check whether the canonical PDP URL exposes all variant prices in a single JSON-LD block or a JavaScript data layer. Many Shopify and WooCommerce stores embed the full variant price matrix in a window.__INITIAL_STATE__ or application/ld+json script tag visible in the initial HTML response — extractable without per-variant requests.

When variant prices differ materially (e.g., XL commands a $20 premium), track each variant as a separate row with a variant_id field appended to the natural key. When prices are uniform across variants and only availability differs, a single parent SKU row with a variant_availability JSON column is sufficient and far cheaper to maintain.

For retailers that render variant prices only after a swatch click (pure client-side state), use mode: js_rendering with a session_id to simulate the click sequence, or consider whether the variant price data is available in the network requests captured by the headless browser — some retailers expose an internal pricing API endpoint that is more stable than the DOM.

6.Geo-specific storefronts and regional pricing

A US and a DE storefront for the same retailer often differ in SKU availability, promotional pricing, currency, and even which products are listed. If your pricing strategy covers multiple markets, treat each geo as a separate retailer_id dimension in your schema — do not merge US and DE rows under the same key or your diff models will produce false price-change alerts on currency fluctuations.

Pin proxy: residential:de for EU storefronts and proxy: residential:us for North American ones. Some retailers serve different prices based on IP geolocation alone, even on a single global domain — a residential IP in the target market is the only reliable way to see the price a local consumer sees.

Log the proxy_country field on every warehouse row so you can filter dashboards by market and audit which geo a price observation came from. This also helps when a retailer adds geo-blocking mid-campaign: the block rate metric will spike for a specific country dimension rather than globally, making the root cause obvious.

7.Operational metrics and health monitoring

Alert when silent failure rate exceeds 0.5% for any single retailer. A layout change that breaks css_selectors will poison your pricing models before block rate moves at all — silent failures are the leading indicator of a selector rot problem, not block rate.

Track js_rendering ratio as a weekly trend rather than a point-in-time number. A sustained increase means a retailer has added JavaScript rendering to pages that previously served prices in static HTML. Catching this early lets you update selectors and adjust budget allocation before the ratio reaches a level that significantly impacts cost.

Catalog coverage % — SKUs successfully scraped with valid price / total SKUs expected in the run
Price change detection latency — median hours between a competitor price move and your first observation of it
Block rate by retailer — success:false responses / total attempts, tracked as a daily time series
Cost per million SKU refreshes — sum of billing.charged across all scrape calls in the run
Silent failure rate — rows where success:true but price field is null or empty / total success:true rows
js_rendering ratio per retailer — metadata.method_used === 'js_rendering' / total calls, tracked weekly
DLQ depth by retailer — count of unprocessed dead-letter items, alert if growing across consecutive runs

8.Sale events, flash sales, and anti-bot spikes

During major sale events — Black Friday, Cyber Monday, Prime Day equivalents — retailers tighten WAF rules and reduce rate limits because legitimate traffic is high and they have cover to block aggressive crawlers. This is exactly when your pricing team needs the most accurate data, which creates a direct operational conflict.

Prepare at least 48 hours in advance: enable enable_solver: true across all retailer configs, lower per-domain concurrency from 5 to 3 in-flight, and pre-warm residential sessions with homepage and category page fetches before hitting PDPs. Some retailers bind session trust across navigation — a cold session landing directly on a PDP triggers challenges that a warmed session avoids.

During the event, poll hero SKUs every 15–30 minutes rather than hourly. Use a separate worker pool with a dedicated budget cap for hero SKU polling so a spike in hero SKU costs does not starve the full catalog run. After the event, review metadata.solver_used ratios to understand which retailers required the most challenge-solving overhead — this informs proxy and concurrency tuning for the next event.

9.Compliance and data governance

Scrape only publicly accessible PDP data that your legal team has reviewed and approved for your use case. Public pricing data visible to any anonymous visitor is generally considered publicly available, but the legal landscape varies by jurisdiction and terms of service — get explicit sign-off before launching a new retailer.

Respect robots.txt directives where your legal agreements or internal policy require it. Do not attempt to bypass login walls, paywalls, or wholesale portal authentication to access pricing data that is not publicly visible — this crosses into unauthorized access regardless of technical feasibility.

Implement data retention policies that match your business need. Storing 90 days of price history is typically sufficient for pricing model training; storing raw HTML archives beyond 7–14 days is rarely justified and increases storage costs and legal surface area. Document your retention schedule and enforce it with automated TTL policies on S3 and partition expiry on your warehouse tables.

10.Phased rollout: from pilot to full catalog

Phase 1 — Pilot (weeks 1–3): Select 500 hero SKUs across your top 3 competitor retailers. Focus on getting the schema, pipeline, and alerting right before scaling. Manually review every DLQ item. Measure silent failure rate and block rate daily. Validate that diff alerts are actionable before expanding coverage.

Phase 2 — Full catalog nightly (weeks 4–8): Expand to the full SKU catalog for each retailer on a nightly cadence. Introduce the dbt diff model and Looker dashboard. Automate DLQ replay after selector fixes. Set budget caps per retailer and alert if a single retailer exceeds 15% of total monthly spend — this catches runaway js_rendering escalation early.

Phase 3 — Hourly hero SKU polling (weeks 9+): Spin up a dedicated worker pool for hero SKUs with a higher polling frequency (every 30–60 minutes). Separate this pool's budget from the nightly full-catalog run so the two workloads do not compete. At this stage, integrate pricing signals directly into your repricing engine rather than routing through a human Slack review step — the latency savings are the primary ROI of the hourly cadence.

Frequently asked questions

Should every PDP request use js_rendering mode?

No — start with mode: auto for all retailers. Auto tries the fast HTTP lane first and escalates to a headless browser only when needed. Force js_rendering explicitly only when you have confirmed via metadata.method_used that auto is consistently escalating anyway, or when you need precise control over js_wait_selector timing. Defaulting everything to js_rendering roughly triples cost with no accuracy benefit on static or server-rendered storefronts.

How do I handle MAP pricing that is hidden behind a login?

Do not scrape unauthorized wholesale or dealer portals. Public MAP-visible prices on consumer-facing PDPs are the correct target. If your pricing strategy requires wholesale MAP data, use licensed data feeds from the brand or a data provider with explicit authorization — not a scraper pointed at a gated portal.

What concurrency is safe per retailer domain?

Start at 3–5 in-flight requests per domain and hold that level for at least one full week before increasing. Watch block rate and js_rendering ratio daily. If both stay flat, increase by 2 and repeat. There is no universal safe number — it depends on the retailer's WAF configuration, your proxy pool size, and whether you are using session_id to distribute requests across persistent sessions.

Can OmniScrape extract JSON-LD structured data from PDPs?

The css_extractor output_format works well for visible DOM fields. For JSON-LD embedded in script tags, request output_format: html and parse the application/ld+json block in your worker using a JSON-LD library. JSON-LD is often more stable than CSS selectors on React and Next.js storefronts because it is generated server-side for SEO and changes less frequently than the visual DOM structure. The raw HTML is in body.data.content.

How do I debug a retailer PDP redesign that broke my selectors?

Pull the archived S3 HTML (body.data.content) from the last successful scrape and compare it to today's failure using a diff tool. Identify which DOM nodes moved or were renamed. Update css_selectors in your retailer config registry, test against a sample of DLQ URLs before promoting, and replay the DLQ. Never update selectors in production config without testing against known-failing URLs first.

How do I track whether OmniScrape used a bot solver on a given request?

Check metadata.solver_used and metadata.challenge_solved in the response. Log both fields on your warehouse row alongside scrape_mode. Tracking solver_used as a daily ratio per retailer gives you early warning when a retailer has deployed a new bot challenge — the solver ratio will spike before block rate climbs, because the solver is handling challenges that would otherwise result in failures.

What is the right approach when a retailer starts returning success:true but with empty prices?

This is a silent failure — the most dangerous failure mode because it does not trigger block-rate alerts. The retailer has either changed the CSS selector for the price element, moved pricing behind a JavaScript interaction, or started serving a different page template to your IP range. First, pull the raw archived HTML and inspect whether the price element exists at all. If it does but under a different selector, update css_selectors. If the price is absent from the initial HTML entirely, switch to mode: js_rendering with js_wait_selector pointing to the price element.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.Production workflow: catalog refresh cycle

2.Warehouse schema design for price history

The schema below maps directly to a BigQuery or Postgres append-only table. Partition by scraped_at date for efficient range scans in dbt diff models.

Warehouse row — BigQuery / Postgres

json

1234567891011121314151617{
  "retailer_id": "ret_us_electronics_01",
  "sku": "WH-8842-XL",
  "url": "https://competitor.com/p/wh-8842-xl",
  "title": "Wireless Headphones Pro",
  "price_cents": 7999,
  "was_price_cents": 9999,
  "currency": "USD",
  "in_stock": true,
  "stock_label": "In Stock",
  "rating": 4.6,
  "review_count": 1284,
  "scraped_at": "2026-06-23T06:14:22Z",
  "scrape_mode": "fast",
  "solver_used": false,
  "scrape_cost_usd": 0.0035
}

3.OmniScrape API request for PDP extraction

The css_selectors map is evaluated server-side by OmniScrape and returned in body.data.css_extracted, so your worker receives clean key-value pairs rather than raw HTML to parse.

PDP extraction request

json

12345678910111213141516171819202122POST https://api.omniscrape.io/v1/scrape
X-API-Key: YOUR_KEY
Content-Type: application/json

{
  "url": "https://competitor.com/p/wh-8842-xl",
  "mode": "auto",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "enable_solver": true,
  "css_selectors": {
    "sku": "[data-product-sku]",
    "title": "h1.product-title",
    "price": "[itemprop='price']",
    "was_price": "[class*='was-price']",
    "in_stock": ".availability",
    "rating": "[itemprop='ratingValue']",
    "review_count": "[itemprop='reviewCount']"
  },
  "js_wait_selector": "[itemprop='price']",
  "js_wait_timeout": 8000
}

4.Pipeline architecture: queue to warehouse

5.Variant matrices and URL explosion

6.Geo-specific storefronts and regional pricing

7.Operational metrics and health monitoring

Catalog coverage % — SKUs successfully scraped with valid price / total SKUs expected in the run
Price change detection latency — median hours between a competitor price move and your first observation of it
Block rate by retailer — success:false responses / total attempts, tracked as a daily time series
Cost per million SKU refreshes — sum of billing.charged across all scrape calls in the run
Silent failure rate — rows where success:true but price field is null or empty / total success:true rows
js_rendering ratio per retailer — metadata.method_used === 'js_rendering' / total calls, tracked weekly
DLQ depth by retailer — count of unprocessed dead-letter items, alert if growing across consecutive runs

8.Sale events, flash sales, and anti-bot spikes

9.Compliance and data governance

10.Phased rollout: from pilot to full catalog

Frequently asked questions

Should every PDP request use js_rendering mode?

How do I handle MAP pricing that is hidden behind a login?

What concurrency is safe per retailer domain?

Can OmniScrape extract JSON-LD structured data from PDPs?

How do I debug a retailer PDP redesign that broke my selectors?

How do I track whether OmniScrape used a bot solver on a given request?

What is the right approach when a retailer starts returning success:true but with empty prices?

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.