1.Industry workflow: tiered polling
Not every SKU deserves the same polling cadence, so the workflow tiers them by commercial importance. Tier A — hero products where a single margin point is worth real money — gets polled every two hours. Tier B catalog SKUs get a daily refresh. The long tail might run weekly or on-demand. Each poll sends a request to the OmniScrape API with css_selectors targeting raw price text, a validator normalizes that text to integer cents, a diff engine compares against the previous snapshot, and an alert fires only when the change clears a configurable threshold or the competitor becomes the cheapest in the tracked set.
The alert is only half the job — the other half is defending it. When a sales rep insists 'their price was never that low', you pull the archived HTML from object storage, timestamped at the exact moment the alert fired, and show them the rendered page. That evidence trail converts price intelligence from a credibility argument into a settled fact, which is why the archive step is non-negotiable even though it adds storage cost. At its core the loop is a web scraping API call wrapped in scheduling and diff logic — the hard engineering is in the cadence, normalization, and evidence chain, not the HTTP fetch itself.
A dead-letter queue is equally important: capture every response where success is false or where the price selector returns empty. Without it, a selector break silently stops alerts on an entire competitor and nobody notices until a merchandising review weeks later. Treat empty-price extractions as data-quality events, not successful no-ops.
2.Normalized price snapshot schema
Store price as integer cents, never as a float or a raw display string. Floating-point drift and embedded currency symbols both corrupt comparisons in ways that surface as phantom alerts weeks later. Keep the original price_display string alongside the parsed cents so a normalization bug is debuggable directly from the row. Log scrape_mode from metadata.method_used so finance can attribute infrastructure cost to the specific SKUs and competitors that force expensive browser-rendering escalations.
The map_violation_suspected flag should be set by a rule configured by counsel, not by the scraper itself — what counts as 'advertised' versus a logged-in or cart-revealed price varies by jurisdiction and by your manufacturer contracts. The schema's job is to capture evidence cleanly; the legal interpretation happens downstream.
12345678910111213141516{
"sku": "HERO-001",
"competitor": "retail_b",
"price_cents": 2499,
"currency": "USD",
"price_display": "$24.99",
"unit_count": 1,
"price_per_unit_cents": 2499,
"was_on_sale": true,
"map_violation_suspected": false,
"scraped_at": "2026-06-23T10:00:00Z",
"scrape_mode": "js_rendering",
"alert_fired": true,
"alert_reason": "price_drop_8pct",
"snapshot_s3_key": "snapshots/retail_b/HERO-001/2026-06-23T10:00:00Z.html"
}
3.OmniScrape API request for price extraction
The js_wait_selector pinned to .price-current is what prevents alerting on garbage during hydration. Many storefronts render a skeleton loader or a strikethrough was-price first, then swap in the live current price a few hundred milliseconds later. If you read the DOM too early you capture the placeholder and fire a false drop alert. Setting js_wait_timeout to 6000 ms gives the storefront enough time to settle without letting a slow page stall the entire polling queue.
Matching the proxy country to the storefront is equally important. A US residential IP and a DE residential IP often see different prices, currencies, and tax treatment on the same product URL. Always set proxy to the country where that competitor's storefront is legally priced, and store the proxy country in the snapshot row so analysts know which price basis they are looking at.
Use mode auto as the default. OmniScrape will attempt a fast HTTP fetch first and escalate to headless browser rendering automatically if the response is a bot challenge or an incomplete DOM. Reserve mode js_rendering explicitly only when you know from prior runs that a specific storefront always requires it — this keeps your fast-to-browser ratio predictable and your billing foreseeable.
12345678910111213141516171819POST https://api.omniscrape.io/v1/scrape
X-API-Key: YOUR_KEY
Content-Type: application/json
{
"url": "https://competitor.com/p/hero-001",
"mode": "auto",
"output_format": "css_extractor",
"proxy": "residential:us",
"enable_solver": true,
"css_selectors": {
"price_display": ".price-current",
"was_on_sale": ".badge-sale",
"availability": ".stock-status",
"rating_count": ".review-count"
},
"js_wait_selector": ".price-current",
"js_wait_timeout": 6000
}
4.End-to-end pipeline architecture
The flow starts from a SKU master table in Postgres that records each SKU's tier, assigned competitors, and last-polled timestamp. A priority scheduler reads that table and enqueues poll tasks, respecting per-domain concurrency limits to avoid hammering a single retailer's infrastructure. Workers dequeue tasks, call the OmniScrape API, and write raw responses — including the full HTML from data.content — to object storage before any parsing happens. Parsing after archiving means you always have the ground truth if normalization logic changes later.
Parsed rows flow into the diff engine, which compares price_cents against the most recent snapshot for the same SKU-competitor pair. If the absolute change exceeds a threshold (typically 2–5% depending on category margin) or the competitor becomes cheapest in the tracked set, the alert router fans out to Slack and email with a link to the archived snapshot. Cleared alerts feed the pricing dashboard. Where the business allows automated repricing, a human-approved webhook fires downstream.
Scheduling typically runs on Airflow or a lightweight cron-plus-Redis setup. The diff and alert logic lives in dbt models or plain SQL against the data warehouse, which keeps it auditable and version-controlled. Finance reviews the monthly sum of billing.charged against the tier value of the SKUs being monitored — that conversation decides whether the two-hour tier can expand or has to contract. Keep the billing join simple: one row per scrape request, keyed by sku and competitor, with charged and balance_after logged directly.
5.Price normalization: handling locales and edge cases
Price strings are a localization minefield. '$1,299.99' in the US, '1.299,99 €' in Germany, and '1 299,99' with a non-breaking space in France all represent the same magnitude with different separators. Parse them with a money-aware library — babel in JavaScript or py-moneyed in Python — keyed off the storefront locale rather than rolling your own regex. The regex approach inevitably mis-parses the thousands separator as a decimal point and reports a competitor selling a laptop for twelve dollars, which fires an alert that destroys the team's confidence in the system overnight.
When normalization fails, flag the row and do not alert. A parse failure is a data-quality event, not a price drop. Track normalization_failure_rate as a first-class operational metric, because a spike almost always means a competitor changed their price markup, added a new currency, or introduced a new display format like 'from $X' for variable-price bundles. Catching that spike before it produces false alerts protects the merchandising team's trust in the system far more than any new feature.
Also handle the case where the CSS selector returns an empty string. An empty price is not zero — it means the product is out of stock, behind a login wall, or the selector broke. Distinguish these cases explicitly in your normalization layer: null price with reason 'selector_empty' is a different operational signal than null price with reason 'parse_failed'.
6.Bundles, multi-packs, and per-unit comparison
Multi-pack and bundle URLs are the most common source of misleading 'great deal' alerts. A competitor's three-pack at $60 looks like an undercut against your single unit at $25 until you divide it out to $20 per unit and realize it is actually a premium. The fix is a mapping table that links each competitor SKU to your internal SKU along with a unit_count, so the alert logic always compares price_per_unit_cents rather than raw sticker price. Store unit_count in the snapshot row so the math is auditable.
Maintaining that mapping is ongoing work. Competitors relaunch SKUs, change pack sizes, rename products, and introduce limited-edition variants. Build a review queue for unmapped competitor SKUs rather than dropping them silently. An unmapped SKU is invisible to alerting — that blind spot is most dangerous during a competitor's promotional period when you most need the signal. A weekly report of 'competitor SKUs seen but not mapped' takes ten minutes to build and saves significant manual investigation later.
Subscription and loyalty pricing adds another layer of complexity. Some storefronts show a lower 'subscribe and save' price as the primary display price. Decide upfront whether your monitoring tracks the one-time purchase price, the subscription price, or both, and encode that decision in the selector strategy — not in an ad-hoc normalization patch applied later.
7.Operational and business metrics to track
Merchandising trusts false positive rate far more than raw scrape volume. A system that fires ten alerts a day with three of them wrong gets muted within a week regardless of how much data it collects. Drive false positives down by fixing selectors and normalization before widening coverage. Selector health is the leading indicator: if the percentage of requests returning a non-empty price drops, a selector broke before any false alerts fire.
Margin impact is the metric that renews the budget. Tie repricing actions taken to the specific alerts that triggered them and you can show the program's dollar value rather than just its throughput. Even a rough attribution — 'this alert led to a price match that retained $X in sales' — is enough to justify the infrastructure cost to finance. Build that attribution into the dashboard from day one rather than retrofitting it when budget review arrives.
The js_rendering escalation rate is a cost-control signal. If it climbs above your modeled ratio, a competitor has likely added bot protection or changed their rendering approach. Investigate before the bill arrives rather than after.
- Alert latency — time from competitor price change to your team's notification
- False positive rate — alerts reversed on manual review within 24 hours
- SKU coverage — percentage of tier A and tier B catalog actively monitored
- Normalization failure rate — rows where price parsing returned null
- Selector health — percentage of requests returning non-empty price selectors
- Cost per monitored SKU per month — billing.charged summed by SKU tier
- Margin impact — revenue delta attributable to repricing actions triggered by alerts
- js_rendering escalation rate — percentage of requests that required browser rendering
8.MAP monitoring and legal evidence collection
Minimum advertised price violations are tempting to automate but legally sensitive. The scraper's job ends at evidence collection, not accusation. Capture and archive the rendered HTML showing the advertised price — stored via data.content from the OmniScrape response — set a conservative map_violation_suspected flag, and route the case to a human before anything resembling an enforcement action goes out. The flag name includes 'suspected' deliberately: the system does not conclude, it surfaces.
Let counsel define the threshold rules that set the suspected flag, because what counts as 'advertised' versus a logged-in price, a cart-revealed price, or a third-party marketplace listing varies by jurisdiction and by your specific contracts with manufacturers. The system that wins here is the one that hands legal a clean evidence packet — timestamped URL, archived HTML, parsed price, proxy country — not the one that emails a competitor an automated cease-and-desist. Build the evidence packet format in collaboration with legal before you start collecting data, not after you need to use it.
Also track the advertised price separately from the checkout price where possible. Some retailers comply with MAP at the product display page level but apply automatic discounts at checkout. Whether that constitutes a violation is a legal question; your scraper's job is to capture both numbers so the question can be answered.
9.Polling frequency, cost modeling, and block risk
Frequency is the lever that trades cost and block risk against latency, and it compounds fast. Polling 500 hero SKUs every two hours over thirty days is roughly 180,000 requests a month before you add tier B or account for retries. Model the fast-to-js_rendering ratio from a two-week pilot on your actual target storefronts before promising sub-hour alerts to stakeholders. A storefront that escalates to js_rendering on every fetch can quietly multiply the bill by an order of magnitude compared to your pilot estimate on a different site.
High frequency also raises the block rate, so polling harder is not free even when the budget allows it. Rate limiters and bot detection systems are tuned to detect polling patterns, not just individual requests. If you genuinely need sub-hour latency, restrict it to a small set of the most volatile SKUs — typically ten to twenty products — and accept daily cadence elsewhere. The patterns in rotating proxies scraping help spread high-frequency load across IPs without tripping per-IP rate limits.
Build a frequency review into the quarterly planning cycle. As the catalog grows and tier assignments shift, the total request volume can drift significantly from the original model. A quarterly review of billing.charged by tier against the commercial value of those SKUs keeps the program economically defensible and gives you data to argue for budget increases when the tier A list expands.
10.Phased rollout: earning trust before scaling
Start narrow and earn trust before scaling. Week one runs 50 hero SKUs against two or three key competitors, with a human reviewing every alert. This surfaces selector bugs, normalization edge cases, and false positive patterns while the blast radius is small. Document every false positive and its root cause — that log becomes the quality checklist that governs expansion.
Week two automates the threshold rules once the false positive rate drops below an agreed target, typically under 5% of alerts reviewed. Week three expands into the tier B catalog with the alerting logic already proven on tier A. Week four introduces the MAP suspected flag for the first time, with legal reviewing the initial batch before the flag is trusted to route autonomously.
Resist the urge to launch the full catalog on day one. A flood of unreviewed alerts, half of them false, will burn the merchandising team's patience before the system has a chance to prove its value. A small, accurate program that grows incrementally is far more durable than a large, noisy one that gets ignored. The rollout phases also give you natural checkpoints to validate cost assumptions before committing to the full polling budget.
Frequently asked questions
How fast can price monitoring realistically run?
Technically you can poll a SKU hourly or faster, but cost and block rate both scale with frequency, so unbounded polling is rarely justified. In practice, a two-hour cadence on tier A hero SKUs covers the vast majority of competitive pricing events in most categories. Sub-hour monitoring makes sense only for a very small set of highly volatile SKUs — flash-sale electronics, for example — and should be modeled against the actual frequency of price changes observed in your pilot data, not assumed.
Why did a price alert fire incorrectly?
The two most common culprits are a selector that grabbed the strikethrough was-price instead of the current price, and a currency or locale mismatch in normalization. The archived HTML from the alert timestamp — stored from data.content in the OmniScrape response — tells you which one it was. A third culprit is a bundle or multi-pack URL being compared against a single-unit price without the unit_count adjustment. Check the snapshot row's price_display string against the archived HTML first; if they match, the bug is in normalization; if they do not, the selector broke.
Is a daily polling cadence fast enough for most categories?
For most non-promotional catalog SKUs, yes. Electronics and durable goods often move slowly enough that a daily check suffices, while flash-sale retail can shift within minutes and may justify a two-hour tier on a small SKU set. The right answer is to measure observed price change frequency on your specific competitors during a two-week pilot, then set cadence to be roughly twice as fast as the median change interval for each tier. Applying one frequency everywhere is almost always either wasteful or insufficient.
Should I use css_extractor output format or fetch full HTML and parse locally?
Use css_extractor as the default — it is faster, simpler, and returns only the fields you need without transferring full page HTML. Switch to fetching full HTML (output_format: 'html') and parsing locally when the site embeds the authoritative price in JSON-LD structured data alongside misleading strikethrough markup in the visible DOM, or when you need to extract data from multiple locations that would require a large number of selectors. JSON-LD prices are often more reliable than rendered DOM prices because they are machine-readable and less likely to include promotional display artifacts.
How do I handle storefronts that require login to show prices?
Use session_id in the OmniScrape request to maintain a logged-in session across polls. Authenticate once, capture the session cookie, and pass the session_id on subsequent requests. Be aware that session-gated prices may not constitute 'advertised' prices under MAP agreements — confirm with counsel before using login-revealed prices as the basis for MAP violation flags. Also rotate sessions periodically, as long-lived sessions on high-frequency polling are more likely to trigger bot detection.
How do I evaluate the cost of OmniScrape against my current scraping setup?
Run a two-week shadow test on your tier A SKU list, sending the same URLs through both approaches and comparing the total cost per successfully normalized price extract. Cost per request is a misleading metric because it ignores failed responses, empty selectors, and incorrectly parsed prices that require manual correction. Cost per good extract — a successful response where price_cents is non-null and passes normalization — is the number that reflects real operational value. Log billing.charged from each OmniScrape response alongside your own infrastructure cost to make the comparison concrete.
What should I do when a competitor changes their price markup and selectors break?
A spike in normalization_failure_rate or selector_empty responses is your early warning. When it appears, pull the archived HTML for the affected competitor and diff it against the previous known-good snapshot to identify the markup change. Update the selector, deploy it, and backfill the gap period with a manual check of the archived HTML to determine whether any real price changes occurred during the outage. Document the incident in your selector change log — patterns in how competitors update their markup often repeat, and the log helps you anticipate future breaks.
Related guides