Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs

1.How to tell you actually need JavaScript rendering

The single most useful diagnostic is view-source (Ctrl+U in Chrome), not DevTools Elements. DevTools shows the live DOM after JavaScript has run. view-source shows what the server sent over the wire. If prices, listings, article body, or review counts are missing in view-source but visible in the browser, the content is client-side rendered.

Look for these signals in the raw HTML: a near-empty <div id="root"> or <div id="app">, skeleton loader divs with no text content, or a large JavaScript bundle as the only meaningful payload. Also search for embedded data objects — __NEXT_DATA__, __NUXT__, window.__INITIAL_STATE__, or application/ld+json scripts. These are server-injected and parseable without rendering. If your target fields appear in those structures, you can skip the browser entirely.

Check the Network tab in DevTools for XHR or fetch requests that fire during page load. Filter by XHR/Fetch and look for calls to /api/, /graphql, or /v2/ that return JSON payloads containing your target fields. If those endpoints are accessible without authentication, calling them directly is faster, cheaper, and more stable than rendering the full SPA. Skeleton loader divs in a saved HTML snapshot — gray pulse containers with no text — confirm you captured the pre-hydration state and rendering is required.

2.Inspect XHR calls before opening a browser

Many SPAs call internal REST or GraphQL endpoints: /api/product/123, /graphql with a products query, or /v2/listings?city=berlin. If those endpoints return JSON without requiring browser-set cookies or signed tokens, fetching them directly eliminates rendering cost entirely. Right-click any XHR call in DevTools Network tab and choose 'Copy as cURL' to get the exact request headers.

Test the copied curl command first without any Cookie or Authorization headers. If it returns data, you have a clean API path — document the endpoint, required query parameters, and any pagination scheme. If it returns 403 or an empty payload, try adding only the session cookie from a manual browser visit. If it still fails, the endpoint requires a full browser session with JavaScript-set tokens, and you need OmniScrape's auto mode with enable_solver or js_rendering.

API endpoints are more brittle than public HTML in one specific way: versioned paths change during backend deploys without redirects. Add endpoint health checks to your CI pipeline alongside selector checks. When an endpoint starts returning 404, your scraper fails silently if you only check HTTP status on the outer page.

3.Configuring js_rendering with js_wait_selector

When you need a real browser, POST to https://api.omniscrape.io/v1/scrape with mode set to js_rendering. The critical parameter is js_wait_selector — a CSS selector that must appear in the DOM before OmniScrape returns the response. Without it, the API returns HTML as soon as the initial document loads, which may be before React or Vue has finished mounting components. js_wait_timeout caps the maximum wait in milliseconds; 10000–15000 ms covers most SPAs under normal network conditions.

Choose wait selectors on stable, data-bearing nodes rather than animated wrappers or loading containers. [data-testid='product-price'] is a better wait target than .price-wrapper or .loading-skeleton. If the site uses data-testid attributes consistently, those are your most stable selectors — they are added by developers for testing and rarely change with visual redesigns. Avoid selectors that match elements present in the skeleton state, because the wait condition would resolve before hydration completes.

Combine js_wait_selector with output_format css_extractor to extract structured fields in the same API call. This avoids a second parsing round-trip and reduces the HTML payload your pipeline needs to process.

js_rendering with wait selector and CSS extraction

json

12345678910111213{
  "url": "https://nextjs-shop.example.com/product/sku-4421",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "js_wait_selector": "[data-testid='product-price']",
  "js_wait_timeout": 12000,
  "css_selectors": {
    "title": "h1",
    "price": "[data-testid='product-price']",
    "rating": "[aria-label*='stars']",
    "availability": "[data-testid='stock-status']"
  }
}

4.Use auto mode to avoid paying for browsers you don't need

mode auto attempts a fast HTTP fetch first. If the returned HTML contains the content you need, the request resolves at HTTP cost. If the content is absent — empty root div, challenge page, or bot detection redirect — auto escalates to js_rendering automatically. For mixed catalogs where most category pages are server-rendered but product detail pages are client-rendered, auto mode finds the split without you hardcoding URL patterns.

Every response includes metadata.method_used, which will be either 'fast' or 'js_rendering'. Log this field per URL. If your logs show 100% js_rendering across a large crawl, either your entire target is a SPA (expected) or fast mode is failing to detect complete HTML correctly — test one URL you know is server-rendered to confirm the baseline. If you see 100% fast on a known SPA, your wait selector may be too loose.

Partition your URL list when you have prior knowledge. Known SPA paths — /product/, /listing/, /search/ — go directly to js_rendering. Known static paths — /about/, /blog/, sitemap pages — use auto or fast. This avoids the auto overhead on URLs where you already know the answer.

5.Infinite scroll and click-to-reveal interactions

js_rendering loads the page once and waits for a selector. It does not scroll, click, or interact with the page beyond initial load. If a product listing loads additional items as the user scrolls, a single js_rendering request captures only the first viewport's worth of results — typically 20–48 items depending on the site's page size configuration.

For infinite scroll targets you need a session-level interaction loop: scroll to the bottom of the page, wait for new product card nodes to appear, repeat until a stop condition is met, then extract from the full DOM. This requires Browser-as-a-Service with a Playwright or Puppeteer script. Factor the session minutes and interaction complexity into your cost model before committing to these targets — they are significantly more expensive per page than a single API call.

Click-to-reveal pricing, age verification modals, and 'Load more' buttons that are not scroll-triggered also require interaction. Identify these patterns during target assessment, not after your pipeline has been running for a week returning incomplete data. A quick manual test with DevTools Network tab open shows whether pagination is scroll-based, button-based, or a standard paginated URL pattern (?page=2).

6.Mining __NEXT_DATA__ and embedded JSON without a browser

Next.js Pages Router injects a <script id="__NEXT_DATA__" type="application/json"> tag into the server-rendered HTML. This tag contains the full getServerSideProps or getStaticProps payload — the same data React uses to hydrate the page. Because it is injected server-side, it is present in view-source and in a fast HTTP fetch, even when the visible DOM appears empty in curl output taken before JavaScript runs.

Extracting __NEXT_DATA__ with a regex or an HTML parser replaces browser rendering entirely for these targets. The JSON structure mirrors the page component's props, so you navigate props.pageProps.product.price rather than a CSS selector. Validate the JSON path on each significant site deploy — Next.js App Router (the newer architecture) does not use __NEXT_DATA__ in the same way, and sites migrating from Pages Router to App Router will break assumptions silently.

Similar patterns exist in other frameworks: __NUXT__ for Nuxt.js, window.__INITIAL_STATE__ or window.__PRELOADED_STATE__ for Redux-based apps, and application/ld+json structured data blocks for product schema. Always check for these before defaulting to js_rendering — parsing a script tag is orders of magnitude faster and cheaper than running a headless browser.

Extract __NEXT_DATA__ from server-rendered HTML

python

12345678910111213141516171819202122import json
import re
import requests

response = requests.get(
    "https://nextjs-shop.example.com/product/sku-4421",
    headers={"User-Agent": "Mozilla/5.0 (compatible; scraper/1.0)"},
    timeout=10,
)
html = response.text

match = re.search(
    r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>',
    html,
    re.DOTALL,
)
if match:
    data = json.loads(match.group(1))
    product = data["props"]["pageProps"]["product"]
    print(product["price"], product["availability"])
else:
    print("__NEXT_DATA__ not found — may need js_rendering")

7.Mobile vs desktop render differences

Some sites serve materially different markup depending on User-Agent and viewport. A mobile UA may receive a lighter SPA with fewer components and different CSS selectors. Prices, availability labels, and structured data attributes can differ between the desktop and mobile versions of the same URL. M-dot redirects (m.example.com) add another layer — a desktop request to www.example.com may redirect to m.example.com with different HTML structure.

OmniScrape's browser paths use realistic desktop profiles by default. If your target is a mobile-first site or if you observed different data in mobile DevTools vs desktop, specify the appropriate User-Agent via custom_headers and confirm which version your selectors were built against. Test both variants when prices differ by channel — this is common in travel and hospitality verticals where mobile rates are displayed differently.

When building selectors, always match the UA and viewport to the actual request your pipeline sends. A selector built from a mobile DevTools session will not match desktop-rendered markup, and vice versa. Document the UA assumption in your scraper configuration alongside the selectors.

8.Validate renders in CI before they break in production

Maintain a set of golden URLs — one per major page template — with expected field values or minimum content thresholds. Run these as integration tests in CI on every deploy of your scraping pipeline. Assert that css_extracted.price is non-empty, that the HTML response length exceeds a baseline threshold (a rendered product page should not be under 10 KB), and that data.content contains at least one known stable string from the page.

Compare fast vs js_rendering on the same golden URL weekly. Sites migrate from server-side rendering to client-side rendering without announcement, often as part of a frontend framework upgrade. If a URL that previously resolved in fast mode starts returning empty fields, the site has moved to CSR and your mode configuration needs updating. Catching this in a weekly automated check is far better than discovering it when a business report shows null prices for three days.

Archive HTML snapshots with a timestamp and site version tag when available. When selectors break, diffing the current snapshot against the last known-good snapshot shows exactly which DOM nodes changed. Store snapshots in object storage keyed by URL hash and date — a week of daily snapshots per golden URL is sufficient for most debugging scenarios.

9.Cost control on JS-heavy catalogs

The most common cost mistake is running js_rendering across an entire sitemap because a subset of URLs requires it. Analyze your URL list before the first production run. Categorize by URL pattern, test a sample from each category, and route accordingly: server-rendered paths on auto or fast, confirmed SPA paths on js_rendering. If 15% of your catalog needs rendering, you should be paying browser costs on 15% of requests, not 100%.

css_extractor output format on js_rendering requests reduces downstream processing cost. You receive structured key-value pairs in data.css_extracted rather than full HTML, which means less bandwidth, less parsing CPU in your workers, and a cleaner data contract. The rendering cost is the same either way — the extraction happens server-side at no additional charge.

Set js_wait_timeout conservatively. A timeout of 30 seconds on a page that normally renders in 3 seconds wastes browser minutes when a site is slow or partially down. Start with 12000–15000 ms, monitor p95 render times in your logs, and adjust. Pages consistently hitting timeout are either broken, geo-blocked, or require challenge solving — investigate those URLs specifically rather than raising the global timeout. See headless browser scraping for detailed wait strategy patterns.

10.When JavaScript sites also run Cloudflare or other WAFs

Browser rendering does not bypass a WAF. A Cloudflare JS challenge or managed challenge page rendered in a headless browser still returns challenge HTML, not product HTML. You need enable_solver set to true alongside js_rendering. The solver handles the challenge flow first, establishes a valid session, and then the browser renders the actual page content. Only after the challenge is solved will your js_wait_selector find product nodes.

The response fields metadata.solver_used and metadata.challenge_solved confirm whether the solver was invoked and succeeded. If challenge_solved is false and your selectors return empty, the solver failed — this usually means the site requires residential proxy IP to pass the challenge. Add proxy: 'residential:us' (or the appropriate country code) to your request alongside enable_solver.

Challenge pages inside a SPA shell are particularly deceptive: the outer HTML may look like a valid page structure, but the content area contains Cloudflare's iframe or turnstile widget instead of product data. Your js_wait_selector will time out waiting for a product node that never appears. Always check data.content for challenge indicators when debugging timeouts on protected targets. See Cloudflare bypass for detailed cf_clearance behavior on client-rendered zones.

Frequently asked questions

What is the difference between js_wait_selector and css_selectors?

They serve different purposes. js_wait_selector is a timing control — the API holds the browser open until that CSS selector appears in the DOM, ensuring hydration is complete before the response is returned. css_selectors is an extraction map — it defines which fields to pull from the rendered DOM and return in data.css_extracted. You often use the same selector in both: js_wait_selector waits for the price element to exist, and css_selectors extracts its text content. If you omit js_wait_selector, the response may return before React has mounted the price component, giving you an empty extraction result.

Why does js_rendering time out even with a generous js_wait_timeout?

The selector never appeared within the timeout window. Common causes: the selector is wrong (built from a different page variant or UA), the page returned a geo-block redirect to a different template, a Cloudflare challenge page loaded instead of the product page (add enable_solver), or the site is genuinely slow on that request. Debug by logging data.content — save the first 2000 characters of the returned HTML to see what the browser actually received. A challenge page, a 'region not available' message, or a login wall will be immediately obvious.

Can I use mode auto for SPAs and let OmniScrape decide?

Yes, and it is the recommended default for mixed catalogs. Auto attempts fast HTTP first and escalates to js_rendering if content is absent or a challenge is detected. The tradeoff is a small latency overhead on URLs that ultimately need rendering, because the fast attempt adds a round trip before escalation. For URLs you have already confirmed require js_rendering, specify it directly to skip the fast attempt. Use metadata.method_used in responses to build an empirical map of which URL patterns need which mode.

Does output_format markdown work with JavaScript-rendered pages?

Yes. output_format controls the format of the returned content, not the rendering method. Set mode to js_rendering and output_format to markdown — OmniScrape renders the page with a headless browser, waits for your js_wait_selector, then converts the fully hydrated DOM to markdown. The same applies to html and text output formats. output_format css_extractor is the most efficient option when you know exactly which fields you need.

How do I scrape a React or Vue SPA without running Puppeteer locally?

POST the URL to OmniScrape with mode js_rendering, js_wait_selector set to a stable product node selector, and output_format css_extractor with your field map in css_selectors. The API runs a managed headless browser, waits for hydration, extracts your fields, and returns structured data in data.css_extracted. No local browser install, no Chrome process management, no proxy rotation to configure. If you need the full HTML, use output_format html and parse data.content.

What is the most reliable way to find the right js_wait_selector?

Open DevTools on the target page, wait for it to fully load, then use the Elements panel to find a node that contains your target data and has a stable, specific selector. Prefer data-testid attributes, aria-label attributes, or semantic element combinations over generic class names. Verify the selector matches exactly one element using document.querySelector('[data-testid="product-price"]') in the DevTools console. Then confirm the selector does not exist in the skeleton/loading state by throttling the network to Slow 3G and checking the Elements panel before hydration completes.

My target uses Next.js App Router — does __NEXT_DATA__ still work?

Not reliably. Next.js App Router (introduced in Next.js 13 and the default from Next.js 14 onward) uses React Server Components and a different data serialization format. The __NEXT_DATA__ script tag is a Pages Router convention. App Router pages may embed data in inline script tags with different formats, or may require JavaScript execution to fetch data via client components. Check view-source for any script tags containing JSON-like structures, but be prepared to fall back to js_rendering with css_extractor for App Router targets.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.