1.What teams extract from Amazon PDPs
Start from the business question before writing a single selector. Price monitors need the buy box winner, list price, deal badge, coupon text, and fulfillment type. Catalog teams want title, brand, bullet points, category breadcrumb, main image URL, and variant ASIN relationships. Review analysts pull the star histogram, total review count, and individual review text with verified-purchase flags and reviewer metadata.
Some fields are straightforward — title and brand rarely move. Others are volatile: buy box price can shift every few minutes, BSR updates hourly, and availability text is locale-dependent. Design your schema around the fields your use case actually needs, and instrument alerts when high-value fields come back empty rather than silently storing nulls.
- ASIN and parent ASIN (for variant families — color, size, style)
- Buy box price, currency symbol, and buy box seller name (Amazon Retail vs FBA vs FBM)
- List price (struck-through reference price), savings amount, savings percentage
- Deal badges: Lightning Deal countdown, Prime Exclusive Discount, coupon clip amount
- Title, brand, bullet point features (up to 5), and long-form product description
- Star rating (aggregate), review count, and rating breakdown histogram (5★ through 1★)
- Best Sellers Rank (BSR) per category node — products can rank in multiple nodes
- Availability text, fastest delivery promise, and fulfillment type (Prime, FBA, FBM)
- A+ content presence flag, brand story section, and embedded video count
- Main image URL and alternate image gallery URLs
2.Amazon URL patterns that survive redesigns
Amazon's search and autocomplete endpoints change frequently and trigger bot scoring the fastest — avoid them for bulk data collection. The most stable entry points are PDP URLs keyed by ASIN. Amazon redirects any slug variation to the canonical URL, so you can always use the bare /dp/ASIN form and ignore the human-readable slug entirely.
Build your ASIN list from brand feeds, licensed catalog data, or low-volume category browse — then refresh individual PDPs on a schedule. This keeps your scraping surface predictable and avoids the high-detection-risk search surface. For the reviews endpoint, the ref parameter is optional; the ASIN is what matters.
- PDP (bare): https://www.amazon.com/dp/B08N5WRWNW
- PDP (with slug, redirects to same page): https://www.amazon.com/Echo-Dot-4th-Gen/dp/B08N5WRWNW
- Reviews page: https://www.amazon.com/product-reviews/B08N5WRWNW
- Reviews paginated: https://www.amazon.com/product-reviews/B08N5WRWNW?pageNumber=2
- All offers / seller listing: https://www.amazon.com/gp/offer-listing/B08N5WRWNW
- Category search (higher detection risk): https://www.amazon.com/s?k=wireless+earbuds&page=2
- Marketplace TLD variants: amazon.co.uk, amazon.de, amazon.fr, amazon.co.jp, amazon.ca — each is a fully separate catalog with independent pricing and seller pools
3.Where the data lives in Amazon HTML
Amazon embeds structured data when it helps their own SEO. Always check application/ld+json script blocks for Product schema first — name, image, offers.price, and aggregateRating are frequently present even when the visible DOM is heavily obfuscated or A/B-tested. This makes JSON-LD a reliable cross-check for price selectors.
The buy box price typically renders inside #corePrice_feature_div. Amazon duplicates price text for screen-reader accessibility: the visible formatted price uses span.a-price-whole and span.a-price-fraction, while span.a-price .a-offscreen holds the clean combined number (e.g., "$29.99") — always target .a-offscreen for machine parsing. The list price (struck-through) sits in .basisPrice .a-offscreen or .a-text-price .a-offscreen depending on the category template.
Review histogram bars are anchored by #histogramTable or by data-hook attributes on the reviews page. BSR appears in table rows inside the product details section — the exact wrapper varies by category template (electronics uses a different detail table structure than books or grocery). When in doubt, search for the literal text 'Best Sellers Rank' in the HTML and walk up to the containing row.
Variant selectors (color swatches, size tiles) are driven by JavaScript and inline JSON embedded in a script tag containing 'twister-plus-js-init-data' or similar. If you need variant ASIN mapping, request mode js_rendering and parse that JSON blob rather than trying to click through swatches.
4.How Amazon blocks scrapers
Amazon does not publicly name its bot management vendor, but the observed behavior matches enterprise-grade fingerprinting: TLS/JA3 fingerprint analysis, HTTP/2 settings inspection, behavioral scoring across request sequences, and CAPTCHA challenges on search and high-velocity IPs. Critically, Amazon frequently returns HTTP 200 with a 'dog page' (the cartoon dog error) or a stripped buy box instead of a hard block — your pipeline must validate that actual product content is present, not just that the status code was 200.
Regional storefronts serve different HTML structures and pricing. Scraping amazon.com with a German IP without matching Accept-Language headers and the correct marketplace TLD produces wrong or incomplete data — Amazon may serve a redirect, a localized interstitial, or simply omit the buy box. Dynamic pricing and seller rotation for some categories load via AJAX after first paint, meaning a pure HTTP fetch can return stale or absent buy box data. Use mode auto and verify with js_rendering when prices come back empty.
- CAPTCHA interstitials ('Enter the characters you see below') on search and high-frequency PDP access
- 'Sorry, we just need to make sure you're not a robot' pages that return HTTP 200
- IP reputation scoring — datacenter IP ranges fail first; residential proxies matched to marketplace country are required
- Zip-code and delivery-context-dependent pricing on grocery, pantry, and some electronics categories
- A/B layout tests that relocate #corePrice_feature_div or replace it with new wrapper IDs without notice
- Login prompts on review pagination at scale — Amazon gates deep review pagination behind account sessions
- Session-linked URL tokens in some search and offer-listing URLs that expire within minutes
5.Scrape a product detail page with OmniScrape
Send the PDP URL to POST https://api.omniscrape.io/v1/scrape with mode auto and a US residential proxy so Amazon serves the same buy box a shopper in that region sees. Setting output_format to css_extractor lets OmniScrape evaluate your selectors server-side and return only the extracted values — no HTML parsing in your application code. Cross-check the buy box price against JSON-LD if buy_box_price comes back empty; that combination catches both selector drift and soft blocks.
The response will contain body.data.css_extracted with your named fields. Check body.success and inspect body.metadata.method_used to understand whether OmniScrape escalated to a full browser render. If method_used is 'fast' and buy_box_price is empty, retry with mode js_rendering and js_wait_selector set to #corePrice_feature_div.
12345678910111213141516171819{
"url": "https://www.amazon.com/dp/B08N5WRWNW",
"mode": "auto",
"output_format": "css_extractor",
"proxy": "residential:us",
"css_selectors": {
"title": "#productTitle",
"buy_box_price": "#corePrice_feature_div .a-price .a-offscreen",
"list_price": ".basisPrice .a-offscreen",
"deal_badge": "#dealBadge_feature_div .a-badge-label",
"rating": "#acrPopover span.a-size-base",
"review_count": "#acrCustomerReviewText",
"bsr": "#productDetails_detailBullets_sections1 tr:has(th:contains('Best Sellers Rank')) td",
"availability": "#availability span",
"seller": "#sellerProfileTriggerId",
"brand": "#bylineInfo",
"main_image": "#landingImage"
}
}
6.Pull review histogram and individual review text
The star histogram is available on the PDP itself, but individual review text requires the dedicated reviews endpoint. Paginate with ?pageNumber=N — Amazon typically shows 10 reviews per page. Keep concurrency low and introduce per-request delays; Amazon ties review scraping detection to both IP reputation and request cadence. Do not parallelize review pagination aggressively.
The data-hook attributes on the reviews page are more stable than class-based selectors — Amazon has kept data-hook='review-title', data-hook='review-body', and data-hook='avp-badge' consistent across redesigns. The histogram percentage bars on the reviews page use aria-label attributes that include the percentage as text, which is more reliable than trying to measure bar width.
1234567891011121314151617181920{
"url": "https://www.amazon.com/product-reviews/B08N5WRWNW",
"mode": "auto",
"output_format": "css_extractor",
"proxy": "residential:us",
"css_selectors": {
"overall_rating": "[data-hook=rating-out-of-text]",
"total_reviews": "[data-hook=total-review-count]",
"histogram_5star": "[data-hook=histogram-row-5-star] [aria-label]",
"histogram_4star": "[data-hook=histogram-row-4-star] [aria-label]",
"histogram_3star": "[data-hook=histogram-row-3-star] [aria-label]",
"histogram_2star": "[data-hook=histogram-row-2-star] [aria-label]",
"histogram_1star": "[data-hook=histogram-row-1-star] [aria-label]",
"review_titles": "[data-hook=review-title]",
"review_bodies": "[data-hook=review-body] span",
"review_dates": "[data-hook=review-date]",
"verified_badges": "[data-hook=avp-badge]",
"reviewer_names": ".a-profile-name"
}
}
7.Fallback: parse JSON-LD when CSS selectors break
When Amazon ships an A/B layout test that moves or renames price divs, the JSON-LD Product schema embedded for Google Shopping often still validates correctly. Request output_format html, locate all script[type="application/ld+json"] blocks in body.data.content, filter for @type === 'Product', and parse offers.price, offers.priceCurrency, offers.availability, and aggregateRating.ratingValue from the structured data.
This approach is slower than css_extractor because you receive and parse the full HTML, but it survives layout churn far better. The recommended production pattern is to run both in parallel: use css_extractor as the primary path for speed, and fall back to JSON-LD parsing when css_extracted.buy_box_price is empty or null. An empty css_extractor result with a populated JSON-LD price is a reliable signal that a selector needs updating — log it as a selector drift alert rather than a data gap.
For the offers page (/gp/offer-listing/ASIN), JSON-LD is less useful because it only reflects the buy box winner. You need to parse the HTML table rows to get all competing seller prices, conditions, and shipping costs.
8.Multi-marketplace scraping across Amazon TLDs
Each Amazon TLD is a fully independent catalog. The same ASIN may be listed on amazon.com, amazon.co.uk, amazon.de, and amazon.co.jp with different prices, different sellers, different review counts, and different availability — treat them as separate products in your data model. Always match the proxy country to the marketplace TLD: use residential:de for amazon.de, residential:gb for amazon.co.uk, residential:jp for amazon.co.jp. Mismatched proxies produce incorrect pricing, missing buy boxes, and sometimes marketplace redirects.
Currency and VAT display rules differ by marketplace. German and French storefronts display VAT-inclusive prices by default; UK prices include VAT for consumer-facing listings. Store the raw price string and currency code from each marketplace independently, and normalize to a base currency in your ETL layer — do not rely on Amazon to expose USD equivalents on non-US storefronts.
For price monitoring across regions, include marketplace as a first-class dimension on every row in your data store. Index on (asin, marketplace, scraped_at) to support time-series queries and cross-market price differential analysis.
9.Amazon Terms of Service and legal considerations
Amazon's Conditions of Use explicitly restrict automated access to their platform without prior written permission. The legal landscape for web scraping public product data continues to evolve — hiQ v. LinkedIn established that scraping publicly accessible data is not automatically a CFAA violation, but it does not provide blanket authorization for commercial scraping of any platform, and Amazon has pursued legal action against scrapers independently of CFAA arguments.
Many teams operate MAP monitoring programs under contractual relationships with brands or Amazon's own Brand Registry tooling — this is a different legal posture than unilateral competitive scraping. Before deploying a production Amazon scraper at scale, confirm the use case with legal counsel. Scope your data collection to public product and review text on PDPs; do not collect buyer names, shipping addresses, order histories, or any account-linked identities. OmniScrape provides the technical infrastructure for making HTTP requests; determining whether a specific use case is legally permissible in your jurisdiction is your responsibility.
Frequently asked questions
Should I scrape Amazon search results or go directly to PDP URLs?
Go directly to PDP URLs keyed by ASIN whenever possible. Search result pages trigger CAPTCHA and bot scoring much faster than PDPs, encode session state in URLs that expire, and have less stable HTML structure. Build your ASIN list from brand feeds, licensed catalog data, or low-volume category browse — then refresh individual PDPs on a schedule. This keeps detection risk low and your data model clean.
Why is the buy box price empty in my scrape?
There are four common causes: (1) selector drift from an A/B layout test — the price moved out of #corePrice_feature_div; (2) the price loads via JavaScript after first paint and a fast HTTP fetch returned the pre-render HTML; (3) the proxy country does not match the marketplace TLD, causing Amazon to suppress the buy box; (4) a soft block returned a dog page with HTTP 200. Diagnosis: check body.metadata.method_used — if it's 'fast', retry with mode js_rendering and js_wait_selector set to '#corePrice_feature_div'. Also parse JSON-LD from the same response; if JSON-LD has a price but css_extracted does not, you have selector drift.
Does OmniScrape solve Amazon CAPTCHAs automatically?
Yes — mode auto escalates to a full headless browser session when Amazon serves a CAPTCHA or challenge page, and OmniScrape's Web Unlocker handles the solve. Success rate is high for PDP URLs with residential proxies but is not guaranteed at extreme concurrency on search endpoints. Keep per-IP request rates modest, prefer PDP URLs over search, and stagger requests rather than bursting. See web scraping without getting blocked for rate discipline patterns.
How do I track all sellers on a single ASIN, not just the buy box winner?
Scrape the offer-listing URL (/gp/offer-listing/ASIN) which lists all competing offers in a table. Parse each row for seller name, price, condition (New/Used), fulfillment type (FBA/FBM), and shipping cost. The buy box winner is marked separately. You can also monitor buy box rotation on the PDP by polling #sellerProfileTriggerId and alerting when the seller name changes — this is lighter weight than parsing the full offers page on every cycle.
How do I scrape Amazon review pagination without getting blocked?
Use the /product-reviews/ASIN endpoint with ?pageNumber=N. Keep concurrency at 1 request per ASIN at a time, add a delay of several seconds between page requests, and use residential proxies. Amazon gates deep pagination (beyond page 5–10) more aggressively than early pages. If you hit a login prompt, that session or IP has been flagged — rotate proxy and resume from the last successful page. Avoid scraping reviews for hundreds of ASINs simultaneously from the same IP pool.
Can I scrape Amazon product data for price comparison or MAP monitoring?
Technically yes, but the legal permissibility depends on your use case and jurisdiction. MAP monitoring on behalf of brands you represent or have contracts with is a common and generally lower-risk use case. Building a public price comparison engine that republishes Amazon data at scale is higher risk — Amazon's ToS prohibit this and they actively enforce it. Confirm your specific use case with legal counsel before deploying at production scale.
How do I handle Amazon's different category templates in my selectors?
The safest approach is to maintain a selector map per category type (electronics, books, grocery, apparel) and detect which template you received by checking for landmark elements. Alternatively, use the JSON-LD fallback as your primary extraction path for fields like price and rating — it's template-agnostic. For fields that only exist in the DOM (BSR, availability, seller name), write defensive selectors that try multiple candidates in order and log which one matched, so you can track template drift over time.
Related guides