1.DTC catalog fields from Shopify
Competitive catalog monitoring typically needs variant-level granularity — not just the product title, but the specific SKU, option combination (size + color), current price, and compare-at price that signals a markdown. Inventory signals are valuable even when exact counts are hidden; a sold-out variant tells you something about demand.
The fields below are available from /products.json on stores that expose it, and from PDP HTML on stores that do not. The JSON API is authoritative for structured fields like created_at and tags; HTML scraping is required for fields rendered client-side.
- Product handle, title, vendor, and product_type
- Variant ID, SKU, price (in cents as string), compare_at_price
- Option names and values per variant (e.g., Size: M, Color: Navy)
- available boolean and inventory_quantity when exposed in JSON
- Product tags and collection membership for category inference
- Primary and gallery image URLs with alt text
- Product description as raw HTML (body_html in JSON)
- created_at and updated_at timestamps from the JSON API
- Metafields if exposed via theme liquid (rarely in JSON, sometimes in HTML)
2.Shopify URL patterns
Every Shopify storefront — whether on a myshopify.com subdomain or a custom domain — follows the same routing conventions. This means discovery logic written for one store transfers directly to another. The key paths to know are the product JSON endpoint, the collection JSON endpoint, and the XML sitemap.
Pagination on /products.json uses page and limit query parameters. The maximum limit is 250. When a page returns fewer than 250 results, you have reached the end of the catalog. Always check the products array length rather than relying on a total count field — the endpoint does not return one.
- Product page: https://brand.com/products/wireless-earbuds
- Product JSON: https://brand.com/products/wireless-earbuds.json
- All products JSON: https://brand.com/products.json?limit=250&page=1
- Collection page: https://brand.com/collections/new-arrivals
- Collection products JSON: https://brand.com/collections/new-arrivals/products.json?limit=250
- Product sitemap: https://brand.com/sitemap_products_1.xml (enumerate _2, _3 etc.)
- Password gate: https://brand.com/password — do not attempt to bypass; stop here
3.products.json — try this first
Before rendering any page in a browser, issue a GET to /products.json?limit=250. On a cooperative store, Shopify returns a JSON object with a products array — each element contains the full variant list, option definitions, image URLs, tags, and timestamps. No CSS selectors, no DOM parsing, no JavaScript execution required.
Pagination is straightforward: increment the page parameter from 1 upward until the returned products array has fewer than limit items. For a 600-product catalog at limit=250, you need three requests: pages 1, 2, and 3.
Watch for these failure modes: a 404 means the store has disabled the endpoint at the theme level; a 403 or redirect to /password means the store is gated; an empty products array on page 1 means the endpoint is live but the store has explicitly hidden all products from it. In all three cases, escalate to HTML scraping of individual PDPs discovered via the sitemap.
Rate limiting on /products.json is real. Shopify's platform applies 429 responses on burst traffic, especially on smaller stores on shared infrastructure. Space requests by at least one second per page when paginating, and implement exponential backoff on 429.
4.Theme HTML when JSON fails
When /products.json is unavailable, scrape individual product detail pages (PDPs). Shopify's default Dawn theme uses a consistent set of CSS classes that you can rely on across Dawn-based stores. The current price lives in span.price-item--regular, the crossed-out compare-at price in span.price-item--compare, the product title in h1.product__title, and the vendor in a element or span with class product__vendor.
Variant pickers in Dawn use either a native select element or a set of radio inputs, both carrying data-option-name attributes that identify which option dimension they control (Size, Color, etc.). The selected variant's ID is written to an input[name='id'] hidden field when a variant is chosen.
Custom themes break all of this. A heavily customized store may use completely arbitrary class names, React or Vue components that render no static HTML, or a headless frontend that fetches product data from a separate API. When you encounter a blank or near-empty DOM, check the page source for an embedded JSON blob — many Shopify themes inject the full product object into a script tag as window.ShopifyAnalytics.meta or a JSON-LD block. Parsing that blob is faster and more reliable than scraping rendered HTML.
To find the embedded JSON, look for a script tag containing 'product' and 'variants' in the raw HTML response. A simple regex or a JSON-LD parser on application/ld+json script tags will surface structured product data even when the visible DOM is sparse.
5.Shopify bot protection layers
Shopify's platform applies its own bot mitigation at the infrastructure level, separate from any app a merchant installs. This manifests as 429 rate limits on JSON endpoints, JavaScript challenges on storefront pages under heavy load, and occasional CAPTCHA injection on checkout flows. For catalog scraping (not checkout), the platform-level protection is manageable with residential proxies and reasonable request rates.
Merchant-installed apps add a second layer. Locksmith and similar access-control apps can gate entire collections or individual products behind login or password prompts. These are application-level gates, not network-level blocks — the page loads but renders a form instead of product content. Detect them by checking for a password input or a login redirect in the response.
Merchants who route their custom domain through Cloudflare introduce a third layer. Cloudflare's bot score system can block or challenge requests that look automated, even on stores where /products.json would otherwise be open. Use mode auto with enable_solver: true and a residential proxy to handle Cloudflare challenges transparently. See Cloudflare bypass for a full breakdown.
- 429 rate limits on /products.json at sustained high request rates
- Password-protected pre-launch or wholesale stores — do not bypass
- Per-merchant Cloudflare bot scoring on custom domains
- JavaScript-only price rendering after variant selection in some themes
- Inventory counts hidden until a specific variant is selected via theme JS
- Locksmith and similar apps gating collections behind login
- IP-based geo-restrictions on certain regional DTC stores
6.Fetch products.json via OmniScrape
Use output_format html to retrieve the raw JSON response body as a string in data.content — then parse it as JSON in your application. The mode auto setting handles both plain HTTP responses and any lightweight JavaScript challenges Shopify may inject. A residential US proxy reduces the likelihood of geo-based blocks and mimics the traffic pattern of a real shopper.
The response body in data.content will be a JSON string. Parse it with JSON.parse(body.data.content) to access the products array. Check that the array is non-empty before paginating — an empty array on page 1 means the endpoint is disabled for this store.
123456{
"url": "https://example-brand.com/products.json?limit=250&page=1",
"mode": "auto",
"output_format": "html",
"proxy": "residential:us"
}
7.Scrape a product page with CSS selectors
When /products.json returns a 404, 403, or empty array, fall back to scraping individual PDPs. Use output_format css_extractor with a css_selectors map targeting Dawn theme classes. The OmniScrape API evaluates the selectors server-side and returns extracted values in data.css_extracted — no HTML parsing in your application code required.
The selectors below work reliably on Dawn-based stores. For custom themes, inspect the target store once and update the selector map accordingly. If a selector returns an empty string, the theme likely renders that field via JavaScript after page load — in that case, switch to mode js_rendering with a js_wait_selector to ensure the element is present before extraction.
For stores that embed product data in a JSON-LD script tag, you can also fetch with output_format html and extract the ld+json block from data.content — this avoids selector fragility entirely.
123456789101112131415{
"url": "https://example-brand.com/products/classic-hoodie",
"mode": "auto",
"output_format": "css_extractor",
"proxy": "residential:us",
"css_selectors": {
"title": "h1.product__title",
"price": "span.price-item--regular",
"compare_at": "span.price-item--compare",
"description": "div.product__description",
"vendor": "span.product__vendor",
"sku": "span.product__sku",
"availability": "button[name='add']"
}
}
8.Variant selection and JavaScript-rendered prices
Some Shopify themes — particularly heavily customized ones and those using React-based headless frontends — do not render the current variant's price in the initial HTML. The price element is present but empty, or shows the default variant price only after a client-side state update triggered by variant selection. If your CSS extractor returns an empty price field, this is the likely cause.
The cleanest solution is to parse the embedded product JSON that most Shopify themes inject into the page source. Look for a script tag containing a JSON object with a variants key — it will include price and compare_at_price for every variant without requiring any JavaScript execution. Fetch the page with output_format html, retrieve the raw HTML from data.content, and extract the JSON blob with a regex or an HTML parser targeting script[type='application/json'] or script[id='ProductJson'].
When the embedded JSON approach is not viable — for example, on a fully headless store that loads product data via a client-side API call — use mode js_rendering with a js_wait_selector targeting the price element. This tells OmniScrape to hold the headless browser open until the element appears in the DOM before returning the rendered HTML.
See scrape JavaScript rendered pages for a full treatment of js_wait_selector patterns and timeout configuration.
1234567891011121314{
"url": "https://example-brand.com/products/classic-hoodie",
"mode": "js_rendering",
"output_format": "css_extractor",
"proxy": "residential:us",
"js_wait_selector": "span.price-item--regular",
"js_wait_timeout": 5000,
"css_selectors": {
"title": "h1.product__title",
"price": "span.price-item--regular",
"compare_at": "span.price-item--compare",
"sku": "span.product__sku"
}
}
9.Competitive monitoring ethics and legal boundaries
Scraping publicly accessible product pages for competitive price intelligence is a common and widely practiced use case. Courts in multiple jurisdictions have found that scraping publicly available data does not inherently constitute unauthorized access, but this is not a blanket permission — store terms of service, regional data protection laws, and the nature of the data all matter.
Password-protected stores are a hard boundary. A /password gate signals that the merchant has restricted access. Attempting to bypass it — whether by replaying session tokens, brute-forcing credentials, or exploiting application logic — constitutes unauthorized access under computer fraud statutes in most jurisdictions. Do not do it.
Checkout flows, customer account pages, and order history are out of scope for competitive monitoring. These contain personal data and are explicitly restricted by every Shopify store's terms. Stick to public catalog pages: PDPs, collection pages, and the /products.json endpoint.
Rate-limit your requests to avoid service degradation for the store's actual customers. A reasonable ceiling for catalog monitoring is one request per second per domain. Implement backoff on 429 responses and do not retry aggressively. Being a considerate scraper reduces the likelihood of IP blocks and keeps your monitoring sustainable.
Frequently asked questions
Should I use products.json or HTML scraping for a Shopify store?
Always try /products.json?limit=250&page=1 first. It returns structured variant-level data with no DOM parsing. Fall back to PDP HTML scraping only when the JSON endpoint returns a 404, 403, redirects to /password, or returns an empty products array. HTML scraping is slower, more fragile to theme changes, and requires selector maintenance.
How do I paginate through a full Shopify catalog?
Increment the page parameter starting from 1, keeping limit at 250. Stop when the returned products array contains fewer items than the limit value — Shopify does not return a total count field, so array length is your termination signal. For a 600-product store you need three requests: page=1, page=2, page=3 (the third returns 100 items, signaling the end).
How do I discover all product handles without paginating products.json?
Fetch the XML sitemap at /sitemap_products_1.xml. Shopify generates one sitemap file per 5,000 products and lists additional files in /sitemap.xml. Each products sitemap contains the canonical URL for every product, from which you can extract the handle. This approach works even when /products.json is disabled, and it gives you the full URL list for PDP scraping.
Why does Cloudflare appear on a Shopify store?
Shopify's infrastructure does not include Cloudflare by default, but merchants can route their custom domain through Cloudflare independently. When they do, Cloudflare's bot score system evaluates every request. Use mode auto with enable_solver: true and a residential proxy — OmniScrape's Web Unlocker handles the challenge automatically. See Cloudflare bypass for detailed configuration.
How do I get prices for all variants, not just the default?
The /products.json endpoint includes price and compare_at_price for every variant in the variants array — this is the most reliable method. For PDP HTML scraping, look for an embedded JSON blob in a script tag (often script[id='ProductJson'] or a script containing Shopify.product). It contains the full variant price matrix without requiring JavaScript execution. Only resort to js_rendering with variant click simulation when neither of these approaches is available.
Can I get real-time inventory counts from Shopify?
Only if the store exposes them. The /products.json endpoint includes an available boolean per variant on all stores, and inventory_quantity when the merchant has not hidden it (Shopify allows merchants to hide exact counts). PDP HTML sometimes shows 'Only 3 left' or similar text, but this is theme-dependent. For monitoring purposes, treat inventory as a boolean in_stock signal derived from the available field — exact counts are unreliable across stores.
What is the difference between mode auto and mode js_rendering for Shopify?
Mode auto tries a fast HTTP request first and escalates to a headless browser only if the response indicates a JavaScript challenge or the page is empty. For /products.json and most static PDPs, auto resolves via HTTP without browser overhead. Use mode js_rendering explicitly when you know the price or inventory element is injected by client-side JavaScript after page load — pair it with js_wait_selector targeting the element you need to ensure it is present before the page is returned.
Related guides