Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns

1.What a web scraping API actually abstracts

A scraping API productizes the entire anti-bot arms race into one POST request. Behind the endpoint sits a fleet of rotating residential and datacenter IPs, TLS fingerprint spoofing calibrated to match real browser profiles, headless browser workers with stealth patches, and integrated challenge solvers for Cloudflare Turnstile, hCaptcha, DataDome, and Akamai Bot Manager. Your application code sends intent — URL, desired output, geo, whether to solve challenges — and the vendor's infrastructure maintains the countermeasures.

Beyond raw HTML retrieval, a well-designed scraping API returns structured extraction results, execution metadata (which rendering path ran, whether a solver fired, elapsed time), and per-request billing data. That contract matters across the organisation: finance teams can forecast spend from billing.charged per URL type; QA teams can assert on metadata.method_used to prove data lineage; on-call engineers can branch error handling on specific HTTP status codes rather than parsing opaque error strings.

The practical outcome is that your engineers spend time on data modelling and pipeline reliability rather than on rebasing cloudscraper forks or negotiating proxy pool quotas. The vendor absorbs the cost of keeping pace with protection vendor updates; you absorb a per-request fee that is predictable and auditable.

2.The endpoint, authentication, and minimal request

Every OmniScrape fetch goes to a single endpoint: POST https://api.omniscrape.io/v1/scrape. Authentication is via the X-API-Key request header. The same key covers all products on your account — Web Unlocker fetches, Browser-as-a-Service sessions, and residential proxy allocation — so you manage one credential per environment rather than one per product.

Never embed API keys in source code or Docker images. Store them in environment variables, a secrets manager (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager), or a CI/CD secrets store. Rotate from the dashboard API Keys page; if you update all workers simultaneously the rotation is zero-downtime. Treat a leaked key as compromised immediately — the dashboard lets you revoke and reissue without changing your account.

The minimal valid request body is three fields: url, mode, and output_format. Every other field is optional and additive. Start minimal, confirm the response shape, then layer in solver, proxy, and extraction options.

Minimal authenticated request

bash

12345678curl -X POST https://api.omniscrape.io/v1/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: osk_live_your_key_here" \
  -d '{
    "url": "https://example.com/product/55102",
    "mode": "auto",
    "output_format": "html"
  }'

3.Modes: auto, fast, js_rendering

mode controls which execution path the API uses. There are exactly three valid values. Choosing the right one for a URL type is the single biggest lever on both cost and reliability.

auto is the default and the recommended starting point for any mixed URL list. The API attempts a lightweight HTTP fetch first. If the target returns a challenge page, a bot-detection interstitial, or content that requires JavaScript execution to populate, the request automatically escalates to a headless browser worker. You pay the higher browser rate only for URLs that actually need it. Use auto when you are onboarding a new target and do not yet know its protection profile, or when a single pipeline handles URLs across multiple domains with different characteristics.

fast is HTTP-only. It is the lowest-cost path and returns in under a second on cooperative targets. It will not escalate to a browser under any circumstances — if the target requires JavaScript rendering or returns a WAF challenge, the request fails rather than escalates. Use fast only after you have confirmed with auto that a specific URL type consistently returns useful HTML on the fast path (check metadata.method_used === 'fast' in responses). Good candidates: sitemaps, RSS feeds, server-rendered product pages on unprotected domains, your own staging environments.

js_rendering always allocates a headless browser worker, regardless of whether the target needs it. Use it when you need js_wait_selector to pause until a specific DOM node appears, when you are scraping a single-page application that hydrates client-side, or when the target's protection is aggressive enough that you want to skip the fast-path attempt entirely. js_rendering is the most expensive path per request; use it deliberately after profiling with auto.

4.output_format options and when to use each

output_format determines what the API returns in the response body. Choose based on where you want parsing to happen and how stable your target's DOM structure is.

html returns the full rendered page HTML in body.data.content. You parse locally with Beautiful Soup, Cheerio, lxml, or goquery. Use this when your selectors change frequently and you want to keep extraction logic in your own codebase, or when you need the full DOM for link discovery in a crawler.

css_extractor performs server-side extraction using the css_selectors map you provide. Results arrive in body.data.css_extracted as a JSON object keyed by your selector names. This avoids shipping raw HTML across the wire and eliminates a local parsing step — useful when extracted fields are small and stable. Requires you to know your selectors in advance.

markdown converts the page to article-style markdown, stripping navigation, ads, and boilerplate. Useful for content monitoring, LLM ingestion pipelines, and change detection where you care about body text rather than full DOM structure.

text returns plain text with all markup stripped. The most compact output; use for keyword monitoring or sentiment pipelines that do not need structure.

Server-side CSS extraction request

json

12345678910111213{
  "url": "https://competitor.com/product/55102",
  "mode": "auto",
  "output_format": "css_extractor",
  "css_selectors": {
    "title": "h1.product-name",
    "price": ".price-current",
    "availability": ".stock-badge",
    "rating": "[data-testid='rating-score']"
  },
  "proxy": "residential:us",
  "enable_solver": true
}

5.enable_solver, proxy, and js_wait_selector

enable_solver: true activates the integrated challenge solver stack. When the headless browser encounters a Cloudflare Turnstile, hCaptcha, or WAF interstitial, the solver attempts to clear it before returning content. Always pair enable_solver with mode: 'auto' or mode: 'js_rendering' — the fast HTTP path cannot interact with challenge pages. On success, metadata.solver_used and metadata.challenge_solved will both be true in the response.

proxy sets the egress IP tier and country. The value format is tier:country — for example, residential:us, residential:gb, residential:de, residential:id. Residential IPs route through real consumer ISP addresses, which is important for geo-restricted storefronts, price localisation, and targets that block datacenter IP ranges. Match the proxy country to the storefront locale you are targeting: a US residential IP on a UK storefront may return GBP prices but redirect to a US product catalogue, or vice versa.

js_wait_selector is a CSS selector string that tells the js_rendering browser to pause and poll the DOM until that node appears before capturing the page. Essential for SPAs where the product data populates after an API call completes. Pair it with js_wait_timeout (milliseconds) to cap the wait and avoid runaway browser workers. Example: js_wait_selector: '[data-testid="price"]', js_wait_timeout: 8000.

session_id lets you reuse the same browser session across multiple requests — useful for workflows that require login state or multi-step navigation. custom_headers allows you to pass Accept-Language, Referer, or other headers that affect target responses.

6.Reading the response: fields, metadata, and billing

A successful response has HTTP status 200 and body.success === true. The primary content is in body.data.content — this is the HTML string (or markdown or text, depending on output_format). For css_extractor requests, body.data.css_extracted holds the extracted fields as a JSON object. body.data.status_code is the HTTP status the target server returned; body.data.final_url is the URL after any redirects.

Do not treat HTTP 200 + success: true as sufficient validation. Always check that body.data.status_code is 200 (not 404, 403, or 503 from the target), and that your extracted fields are non-empty strings. A success: true response with an empty css_extracted object means the selectors did not match — the page loaded, but your extraction logic failed.

The metadata object is your debugging surface. metadata.method_used tells you whether the fast or js_rendering path ran — use this over time to build a per-domain routing table and switch confirmed fast-path domains to mode: 'fast' to reduce cost. metadata.elapsed_time is wall-clock seconds for the full request. metadata.solver_used and metadata.challenge_solved confirm whether a WAF challenge was encountered and cleared.

billing.charged is the cost of this individual request in USD. billing.balance_after is your remaining prepaid balance. Log both fields alongside extracted data; aggregate by method_used and domain to identify which URL types are driving cost.

Successful scrape response with CSS extraction

json

123456789101112131415161718192021222324{
  "success": true,
  "data": {
    "content": "<!DOCTYPE html>...",
    "status_code": 200,
    "final_url": "https://competitor.com/product/55102",
    "css_extracted": {
      "title": "Ultraboost 22 Running Shoe",
      "price": "$180.00",
      "availability": "In Stock",
      "rating": "4.7"
    }
  },
  "metadata": {
    "method_used": "js_rendering",
    "elapsed_time": 4.2,
    "solver_used": true,
    "challenge_solved": true
  },
  "billing": {
    "charged": 0.012,
    "balance_after": 48.88
  }
}

7.HTTP status codes and error handling

Your client should branch on API HTTP status codes explicitly. Treating all non-200 responses as generic failures makes pipelines fragile and obscures actionable signals. The following codes require distinct handling logic:

Beyond HTTP status, a 200 response with success: false indicates a target-level failure — the API call succeeded but the page did not return usable content. Log the full response body including data.status_code, data.final_url, and metadata. Route these to a dead-letter queue for inspection rather than silently dropping them. Common causes: target returned 404 or 503, JavaScript did not render in time (increase js_wait_timeout), challenge was not solved (add enable_solver and residential proxy).

401 — invalid or missing API key; fix credentials before retrying, do not retry in a loop
402 — insufficient account balance; alert billing team, pause all workers until topped up
422 — malformed request body (invalid mode value, missing required field); fix request construction, do not retry
429 — rate limit exceeded; implement exponential backoff with full jitter, respect Retry-After header if present
502 — upstream worker busy or timed out; retry up to 3 times with 1–4 second backoff before dead-lettering
200 + success: false — target-level failure; dead-letter queue, inspect data.status_code and metadata before requeuing

8.Integration patterns that hold up at production volume

The queue-worker model is the most reliable architecture for high-volume scraping pipelines. A producer writes URLs (with scrape parameters as metadata) to a queue — SQS, Redis Streams, RabbitMQ, or Pub/Sub. Stateless worker processes dequeue jobs, POST to the OmniScrape API, validate the response, and write extracted data to a warehouse or object store. Workers scale horizontally; the queue absorbs bursts and provides backpressure. Keep workers stateless so you can terminate and replace them without losing in-flight work.

Wrap your API client in retry logic with idempotent job IDs. If a worker crashes after a successful API call but before writing to the warehouse, the job will be requeued and the API call will run again — design your writer to upsert on job ID rather than insert, so duplicate API calls do not produce duplicate rows. Store the full API response metadata alongside extracted fields: method_used, solver_used, elapsed_time, charged. This gives you a complete audit trail and the data you need to optimise routing over time.

For selector maintenance, keep css_selectors in a configuration store (database table or config file) rather than hardcoded in worker code. When a target site changes its DOM, you update the selector config and redeploy without touching worker logic. Run a daily canary job that scrapes a known URL and asserts on expected field values; alert on empty or malformed extractions before they propagate to downstream consumers.

Start your proof-of-concept with your hardest target URL — the most aggressively protected, most JavaScript-heavy page in your list. If the API clears that URL reliably, the rest of the list is tuning and cost optimisation. Starting with an easy page gives you false confidence. See web scraping with Python for a complete code walkthrough including queue integration and error handling.

9.API vs DIY: where the break-even actually sits

DIY infrastructure — your own proxy pool, Playwright cluster, and challenge solver — has a lower marginal cost per request at very high volume if you have dedicated anti-bot engineering headcount, stable targets, and the organisational appetite to treat scraping infrastructure as a first-class product. That is a narrow set of conditions. Most teams underestimate the maintenance cost: proxy pool churn, browser version updates, fingerprint drift, and protection vendor updates that invalidate solver logic on irregular schedules.

The API wins on total cost of ownership when protection vendors update monthly (Cloudflare ships multiple challenge variants per quarter), when downtime cost — stale data, missed price changes, broken SLAs to downstream consumers — exceeds per-request fees, and when engineering time is better spent on data modelling than on anti-bot countermeasures.

Hybrid is the most common production architecture: API for protected fetches on high-value targets, direct HTTP for open data sources and your own assets. Use metadata.method_used to profile which domains actually need browser rendering and which serve clean HTML on the fast path. Optimise from that data — move confirmed fast-path domains to mode: 'fast' and route confirmed browser-required domains to mode: 'js_rendering' to eliminate the auto escalation overhead. Make routing decisions from observed data, not assumptions about what a site probably does.

10.Governance the API does not solve

Technical access is not the same as legal permission to collect and use data. Before running production schedules against any target, review the target's terms of service, robots.txt directives, and applicable privacy law — GDPR if you may collect EU resident data, CCPA for California residents, and sector-specific regulations for financial, health, or government data. OmniScrape provides the infrastructure; compliance with data protection law and target site terms is your responsibility.

Document your data governance posture before pipelines run unattended: which sources are approved for collection, the legal basis for each, retention periods, PII handling procedures (anonymisation, access controls, deletion workflows), and the contact responsible for each data source. This documentation is what your legal team needs to respond to a data subject access request or a cease-and-desist, and what your security team needs for a vendor risk assessment.

Respect robots.txt as a baseline even where it is not legally binding — it signals the site operator's intent and is relevant context in any legal dispute. Rate-limit your requests to avoid causing service degradation on targets; aggressive scraping that affects site performance creates liability independent of data content.

Frequently asked questions

What is the difference between mode: 'auto' and mode: 'js_rendering'?

auto attempts a fast HTTP fetch first and escalates to a headless browser only if the target requires it — you pay the browser rate only for URLs that need it. js_rendering always allocates a browser worker from the start. Use auto for mixed URL lists and cost efficiency; use js_rendering when you need js_wait_selector, when you know a target always requires a browser, or when you want to skip the fast-path attempt entirely on aggressively protected domains.

Where is the HTML content in the response body?

HTML content is in body.data.content — not body.data.html. For css_extractor requests, extracted fields are in body.data.css_extracted as a JSON object keyed by your selector names. Always reference body.data.content for raw HTML regardless of which mode ran.

How do I debug a success: true response with empty extracted fields?

Log the full response body: data.status_code, data.final_url, data.css_extracted, and metadata.method_used. Empty css_extracted with status_code 200 almost always means the selectors did not match the rendered DOM — either the target changed its markup, or the page requires JavaScript to populate the fields and you used mode: 'fast'. Try the same URL with mode: 'js_rendering' and js_wait_selector set to the container element you expect. If data.content contains a challenge page HTML rather than product content, add enable_solver: true and proxy: 'residential:us'.

What is the most cost-effective output_format?

For structured data extraction, css_extractor on mode: 'auto' is typically most efficient: server-side extraction means you receive only the fields you need rather than full HTML, and auto avoids paying browser rates for pages that serve clean HTML. Requesting output_format: 'html' with mode: 'js_rendering' for every URL is the most expensive combination — reserve it for targets that genuinely require it.

Can I reuse a browser session across multiple requests?

Yes. Pass a session_id string in your request body. Subsequent requests with the same session_id reuse the same browser worker and session state, including cookies and local storage. This is required for workflows that involve login, multi-step forms, or cart interactions. Sessions have a maximum idle timeout; check the API documentation for the current limit and refresh the session before it expires if your workflow spans a long duration.

How should I handle 429 rate limit responses?

Implement exponential backoff with full jitter — do not retry immediately or at fixed intervals, as that creates thundering herd behaviour. A practical formula: wait = min(cap, base * 2^attempt) + random_jitter, where base is 1 second and cap is 30 seconds. Check the Retry-After response header if present and use that value as the minimum wait. Reduce worker concurrency if you hit 429 consistently; the API rate limit is per account, so all workers share the quota.

Does the OmniScrape API replace my crawler?

No — the API fetches one URL per request. URL discovery (following links, parsing sitemaps, recursing through pagination) is still your crawler's responsibility. The API handles the retrieval of each individual URL reliably, including protected pages that a naive HTTP client cannot access. A typical architecture combines a lightweight crawler for URL discovery with OmniScrape API calls for the actual page retrieval. See web scraping vs web crawling for a detailed comparison of the two concerns.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.What a web scraping API actually abstracts

2.The endpoint, authentication, and minimal request

Minimal authenticated request

bash

12345678curl -X POST https://api.omniscrape.io/v1/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: osk_live_your_key_here" \
  -d '{
    "url": "https://example.com/product/55102",
    "mode": "auto",
    "output_format": "html"
  }'

3.Modes: auto, fast, js_rendering

mode controls which execution path the API uses. There are exactly three valid values. Choosing the right one for a URL type is the single biggest lever on both cost and reliability.

4.output_format options and when to use each

output_format determines what the API returns in the response body. Choose based on where you want parsing to happen and how stable your target's DOM structure is.

text returns plain text with all markup stripped. The most compact output; use for keyword monitoring or sentiment pipelines that do not need structure.

Server-side CSS extraction request

json

12345678910111213{
  "url": "https://competitor.com/product/55102",
  "mode": "auto",
  "output_format": "css_extractor",
  "css_selectors": {
    "title": "h1.product-name",
    "price": ".price-current",
    "availability": ".stock-badge",
    "rating": "[data-testid='rating-score']"
  },
  "proxy": "residential:us",
  "enable_solver": true
}

5.enable_solver, proxy, and js_wait_selector

6.Reading the response: fields, metadata, and billing

Successful scrape response with CSS extraction

json

123456789101112131415161718192021222324{
  "success": true,
  "data": {
    "content": "<!DOCTYPE html>...",
    "status_code": 200,
    "final_url": "https://competitor.com/product/55102",
    "css_extracted": {
      "title": "Ultraboost 22 Running Shoe",
      "price": "$180.00",
      "availability": "In Stock",
      "rating": "4.7"
    }
  },
  "metadata": {
    "method_used": "js_rendering",
    "elapsed_time": 4.2,
    "solver_used": true,
    "challenge_solved": true
  },
  "billing": {
    "charged": 0.012,
    "balance_after": 48.88
  }
}

7.HTTP status codes and error handling

401 — invalid or missing API key; fix credentials before retrying, do not retry in a loop
402 — insufficient account balance; alert billing team, pause all workers until topped up
422 — malformed request body (invalid mode value, missing required field); fix request construction, do not retry
429 — rate limit exceeded; implement exponential backoff with full jitter, respect Retry-After header if present
502 — upstream worker busy or timed out; retry up to 3 times with 1–4 second backoff before dead-lettering
200 + success: false — target-level failure; dead-letter queue, inspect data.status_code and metadata before requeuing

8.Integration patterns that hold up at production volume

9.API vs DIY: where the break-even actually sits

10.Governance the API does not solve

Frequently asked questions

What is the difference between mode: 'auto' and mode: 'js_rendering'?

Where is the HTML content in the response body?

How do I debug a success: true response with empty extracted fields?

What is the most cost-effective output_format?

Can I reuse a browser session across multiple requests?

How should I handle 429 rate limit responses?

Does the OmniScrape API replace my crawler?

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.