1.How sites decide you are a bot
Modern anti-bot stacks — Cloudflare Bot Management, Akamai Bot Manager, DataDome, PerimeterX — score every request before your parser ever runs. The score is a weighted sum across three signal families: network, application, and behavioral.
Network signals include IP reputation (datacenter ASN vs residential ISP), autonomous system history, geo velocity (a US IP hitting a German storefront and then Japan two seconds later), and request rate per IP. A single AWS us-east-1 egress IP on a luxury retail site will score near the bot threshold before it sends a single header, because that ASN has been abused by scrapers for years.
Application signals cover TLS fingerprint (JA3 and the newer JA4 scheme), HTTP/2 SETTINGS frames, HPACK header compression, header order, and whether the client negotiates cipher suites consistent with the claimed User-Agent. A scraper that sends Chrome 124 in the User-Agent but negotiates TLS with a Python 3.12 cipher list is flagged immediately — the mismatch is more suspicious than just being honest about being a bot.
Behavioral signals kick in on JavaScript-rendered pages: mouse entropy, scroll depth, time-between-keystrokes, and whether SDK sensors like DataDome's collector script or Akamai's _abck cookie actually executed and reported plausible telemetry. A scraper that nails CSS selectors but skips JS execution fails behavioral checks even if every network and header signal is perfect.
The goal is not invisibility — it is staying below enforcement thresholds while collecting data you are legally entitled to collect. That means realistic concurrency, session continuity, and escalating technical means only when pages require it.
2.The four stages of getting blocked
Blocks escalate in predictable stages, and your response to each should be different. Conflating them leads to wasted budget and burned IP pools.
Stage one is soft rate limiting: HTTP 429 with Retry-After headers, or responses that slow to 10–30 seconds per page. The site is warning you, not banning you. Back off with jitter and resume at a lower rate. Stage two is challenge interstitials — Cloudflare 'Checking your browser', Imperva meta-refresh pages, or CAPTCHA widgets injected into an otherwise normal 200 response. The page is real but gated. Stage three is a hard HTTP 403 with no solve path — usually an IP-level block that requires rotating to a fresh proxy. Stage four is the silent tarpit: HTTP 200 with garbage HTML, empty product grids, or prices stuck at $0.00. The site is deliberately feeding you bad data to waste your compute and poison your database.
Your monitoring must distinguish each stage or you will optimize the wrong metric. A pipeline that treats tarpit 200s as success will silently corrupt every downstream table. Log response status, body byte length, and a presence-check on at least one key selector — not just 'request completed with 200'.
- 429 Too Many Requests — back off with jitter, never increase concurrency through a 429 storm
- 403 + challenge HTML (cf-ray header, datadome script, _abck references) — need browser rendering, a solver, or a residential IP
- 200 + tiny body (<10 KB on a catalog page that normally returns 150 KB) — likely interstitial HTML, not product content
- 200 + empty selectors — JavaScript not rendered or tarpit active; see JavaScript rendering guide
- 200 + plausible but wrong data (price = $0, stock = 0 everywhere) — tarpit; cross-check a known product against a trusted source
3.Rate limits and session hygiene
Match request rate to human browsing patterns for that site category. A news aggregator hitting 50 article URLs per second from one IP will be blocked faster than a price monitor fetching 200 product detail pages over ten minutes with randomized two-to-four second gaps between requests. The absolute rate matters less than the pattern: humans cluster requests, pause, and navigate non-linearly.
Use a per-domain semaphore shared across all workers, not per-worker rate limits. Twenty workers each capped at one request per second is twenty requests per second on a single domain — which may be ten times the safe rate. A centralized token bucket or Redis-backed leaky bucket prevents this.
Reuse sessions and cookies for paginated paths on one domain. Cloudflare cf_clearance, Akamai _abck, and Imperva visid_incap cookies bind trust to a specific browser fingerprint and IP address. Rotating your IP mid-pagination breaks that chain and re-triggers the full challenge flow. For catalog walks, stick to one residential IP until you finish the category or the session expires, then rotate. Treat cookie jars as first-class state, not throwaway artifacts.
Add realistic inter-request delays that vary by page type. Detail pages after a search result page should have a short delay (simulating a click). Navigating between categories can have a longer pause. Pure randomness is better than fixed intervals, but bounded randomness that reflects real user timing is best.
4.When you need proxies (and which kind)
Datacenter proxies are cheap and fast but carry degraded reputation on retail, travel, ticketing, and financial sites. Anti-bot vendors maintain ASN-level blocklists; entire /16 subnets from major cloud providers are pre-scored as high-risk. If Nike blocks your AWS egress IP on the first request, no header tweak or User-Agent rotation fixes that — the block is at the network layer before your HTTP stack speaks.
Residential proxies route through consumer ISP IPs with much higher inherent trust scores. They cost more per GB and have higher latency, but they are the correct tool for protected retail, travel fare comparison, and social platforms. ISP proxies (static residential) split the difference: stable IPs with residential ASNs, useful for sites that penalize IP churn mid-session.
Geo mismatch is a common self-inflicted block: a US IP on an EU-only product release page, a German IP on a US pharmacy with state-level geo restrictions, or a Singapore IP on a Japanese domestic streaming catalog. Set proxy country to match the content geography, not where your scraping server runs. For city-level geo targeting (local business listings, regional pricing), use a proxy pool that supports city-level selection.
Rotating too aggressively hurts as much as not rotating. A fresh IP on every request looks like a bot attack to session-aware defenses. Rotate per session or per domain crawl, not per URL.
123456789curl -X POST https://api.omniscrape.io/v1/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: $OMNISCRAPE_KEY" \
-d '{
"url": "https://eu-retailer.com/product/8821",
"mode": "auto",
"proxy": "residential:de",
"output_format": "html"
}'
5.TLS and headers: the signals you cannot fake easily
Python requests, Go net/http, Node.js axios, and curl all have recognizable TLS fingerprints. The JA3 hash is computed from TLS version, cipher suites, extensions, elliptic curves, and elliptic curve point formats — all negotiated before any HTTP header is sent. Anti-bot vendors maintain blocklists of JA3 hashes associated with common automation libraries. Changing User-Agent to 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124' while your TLS stack still hashes to Python 3.12 is worse than using an honest default — the mismatch is a strong bot signal.
JA4 (the successor scheme) adds transport-layer detail and is harder to spoof because it captures more of the handshake shape. Some vendors also fingerprint HTTP/2 SETTINGS frames: the initial window size, max concurrent streams, and header table size that Chrome sends differ from those that Go's http2 package sends by default.
Header order matters independently of header values. Real Chrome 124 sends headers in a documented sequence; HTTP libraries typically sort or append headers in a different order. Some WAFs score the delta between claimed browser version and actual header ordering. This is why teams eventually stop patching locally and route through infrastructure that presents genuine browser-grade handshakes — the maintenance cost of keeping TLS and header spoofing current across library updates is high.
If you are self-hosting, libraries like curl-impersonate or tls-client (Go) can present Chrome-matching TLS. For anything beyond simple pages, the engineering cost of maintaining that impersonation across browser version updates usually exceeds the cost of using a managed API.
6.Let auto mode pick the cheapest viable path
Not every URL needs a headless browser. OmniScrape mode auto tries fast HTTP first and only escalates to js_rendering when the response indicates the page is blocked or requires JavaScript execution to populate content. That keeps costs proportional on mixed URL lists — blog posts and open government data stay on the fast path; Cloudflare-protected retailers get a browser render when the fast attempt returns a challenge page.
Check metadata.method_used in each response. If you see js_rendering on URLs that should be static, investigate before assuming the cost is justified. The target may have added a new protection layer, or your request configuration may be forcing browser mode unnecessarily. Tracking method_used per domain over time also gives you an early signal when a target upgrades its anti-bot stack.
Use output_format css_extractor with css_selectors to extract structured fields server-side rather than downloading full HTML and parsing locally. This reduces payload size and simplifies your pipeline — the API returns a clean key-value map instead of raw HTML.
When a page has dynamic content loaded after a user interaction, use js_wait_selector to tell the browser render to wait until a specific element appears in the DOM before capturing the page. This prevents empty-selector failures on pages where prices or stock levels load asynchronously.
123456789101112{
"url": "https://mixed-catalog.com/item/44102",
"mode": "auto",
"output_format": "css_extractor",
"enable_solver": true,
"css_selectors": {
"title": "h1.product-title",
"price": "[data-testid='price']",
"stock": "[data-testid='stock-status']",
"sku": "meta[name='sku']"
}
}
7.Metrics worth tracking per domain, weekly
Block rate per domain, not global average. A 98% overall success rate is meaningless if your highest-value competitor domain is at 40% and silently feeding your pipeline empty rows. Aggregate metrics hide the domains that matter most.
CAPTCHA challenge rate per domain — a sudden spike usually means IP pool burnout or a vendor rule update, not a code regression. Treat it as an infrastructure alert, not a software bug. Response body length distribution per domain — a sudden shift toward small bodies (under 15 KB on a page that normally returns 120 KB) means interstitial HTML is leaking into your pipeline. Plot this as a percentile chart, not an average, so outliers are visible.
Cost per successful extracted row, broken down by method_used. Browser renders cost significantly more than fast-lane requests. If your mix shifts toward js_rendering without a corresponding increase in protection complexity on the target, you may be over-engineering your request configuration. Know your cost per data point, not just your cost per request.
Time-to-first-byte and total request latency per domain. Latency spikes on previously fast domains often precede a full block — the site is throttling before it bans. Use latency as a leading indicator, not a lagging one.
When block rate crosses a threshold you define per domain (a reasonable starting point is 5%), pause that domain, rotate the proxy pool, and test with a single URL before resuming bulk. Hammering through an active block burns IPs and budget simultaneously.
8.Mistakes that appear repeatedly in production pipelines
Caching challenge pages as successful scrapes. If your S3 archive stores 'Checking your browser' HTML with a 200 status timestamp, every downstream re-parse fails silently and indefinitely. Add a body-length check and a selector presence-check before writing to storage. Reject and requeue anything that fails.
Assuming more concurrency always equals more throughput. Twenty workers on one burned IP produce twenty 403s per second and exhaust your proxy rotation faster. Cap concurrency per domain and per IP, and measure actual successful rows per minute — not requests per minute.
Using the same failed approach for months without measuring block rates weekly. Anti-bot vendors push rule updates continuously; a configuration that achieved 95% success in January may fall to 60% by March without any change on your side. Schedule a weekly review of per-domain block rates as a standing task.
Ignoring robots.txt and legal constraints while chasing technical wins. Technical access and legal permission to use data are entirely separate questions. Review both before deploying a scraper against a new target, and document your analysis.
Rotating User-Agent on every request without matching TLS, header order, and behavioral signals. Random UA rotation in isolation increases suspicion rather than reducing it. Commit to one realistic browser profile per session and hold it consistent.
Not validating extracted data against known values. Scrape a product whose price you know from another source. If your pipeline returns $0.00 or an empty string, you are in a tarpit. Build canary checks into your pipeline — a small set of known-good URLs whose expected output you validate on every run.
9.An escalation ladder that keeps costs proportional
Each step in this ladder costs more in money or engineering time. Document which step each domain requires so you do not run js_rendering on a Shopify store that the fast path handles without issue.
Step one: direct HTTP with polite rate limits and realistic headers. Correct for the majority of open data sources, news sites, and documentation pages. Step two: add a residential proxy geo-matched to the target storefront when you see 403 responses from datacenter IPs. Step three: enable enable_solver when challenge HTML appears in responses — this handles Cloudflare Turnstile, hCaptcha, and similar interstitials without switching to a full browser render. Step four: switch to mode js_rendering with js_wait_selector when prices, stock levels, or other critical fields are missing from the fast-path response — the page requires JavaScript execution to populate them. Step five: use session_id to maintain a persistent browser session across paginated requests when the target binds trust to a session cookie that must persist across page loads.
Resist the temptation to start at step four. The cost difference between fast and js_rendering across millions of requests is substantial. Measure first, escalate only when data shows the lower step is insufficient. Our rotating proxies guide covers when rotation helps versus when it actively hurts session continuity.
Frequently asked questions
Is a 403 always a bot block, or could it be something else?
Not always. A 403 with challenge markers in the body — cf-browser-verification, DataDome script tags, _abck references, or a Retry-After header — is a bot block. A 403 on an authenticated API endpoint is likely an authorization failure unrelated to anti-bot systems. Check whether the same URL works in a browser with no login session. Log the raw response body; bot blocks are typically short HTML pages (under 20 KB), while API authorization errors usually return JSON with an error field.
How many requests per second is safe for a given site?
There is no universal number — it depends on the site's traffic volume, anti-bot vendor, and your IP reputation. A reasonable starting point is one request every two to four seconds per domain from a single IP, with randomized gaps. Measure the 429 rate and adjust downward if it exceeds 1–2%. Government open-data portals often tolerate higher rates than retail or ticketing sites. Distributed workers must share a per-domain rate limiter; each worker running independently at 'low' rates can combine to an aggressive aggregate rate.
Should I rotate User-Agent on every request?
No. Random UA rotation without matching TLS fingerprint, header order, and behavioral signals increases suspicion rather than reducing it. Anti-bot systems correlate all signals together; a mismatch between claimed browser and TLS handshake is a stronger bot indicator than a consistent non-browser UA. Pick one realistic desktop Chrome profile per session and keep it consistent across all paginated requests on that domain.
When should I stop self-hosting and use a managed scraping API?
When maintenance time exceeds the value delivered — typically after the second time a vendor update breaks your solver or TLS impersonation over a weekend. Managed APIs convert that unpredictable ops burden into a predictable per-request cost. The break-even point depends on your engineering hourly cost versus data volume, but most teams hit it earlier than they expect once they account for proxy management, solver maintenance, and monitoring infrastructure.
How do I detect that I am in a tarpit rather than getting real data?
Build canary checks into your pipeline: a small set of URLs whose correct output you know (a product with a known price, a listing with a known count). Run these on every pipeline execution and alert if the extracted values deviate from expected. Also monitor response body length distribution — a sudden shift toward uniformly small bodies on a domain that normally returns large pages is a reliable tarpit signal. Check body.data.content byte length in OmniScrape responses and reject anything anomalously small before writing to storage.
Does enable_solver work on all CAPTCHA types?
OmniScrape's solver handles common challenge types including Cloudflare Turnstile, hCaptcha, and standard reCAPTCHA v2. It does not guarantee success on every custom or enterprise challenge implementation. Check metadata.solver_used and metadata.challenge_solved in the response to confirm whether a solve was attempted and succeeded. If challenge_solved is false on repeated attempts, the target may be using a challenge type that requires a different approach or a fresh residential IP.
What is the difference between mode auto and mode js_rendering?
Mode auto tries fast HTTP first and escalates to js_rendering only if the response indicates a block or missing content. It is the correct default for mixed URL lists where you do not know in advance which pages need browser rendering. Mode js_rendering always uses a headless browser, which costs more per request but guarantees JavaScript execution from the first attempt. Use js_rendering explicitly only when you have confirmed through auto mode's metadata.method_used that a domain consistently requires browser rendering — that way you skip the fast-path attempt and save the latency of a failed first try.
Related guides