1.Where AWS WAF Sits in the Stack
CloudFront is a globally distributed CDN that caches and filters traffic before it ever reaches your origin server. WAF rules evaluate every request at the edge — before cache lookup, before origin forwarding. Rules can count, CAPTCHA, or block based on labels such as awswaf:managed:aws:bot-control:bot:category:scraping or awswaf:managed:token:absent when Bot Control is configured to require a JavaScript-issued challenge token.
ALB-attached WAF behaves identically for non-CDN architectures: same rule groups, same label taxonomy, same 403/429 responses. The key difference is that CloudFront edge nodes are geographically distributed, so the IP that appears in WAF logs is the client IP, not a CloudFront node. Your datacenter egress IP is fully visible to rate-based rules.
A single CloudFront distribution can serve multiple origins and apply path-based WAF rule overrides. The /api/* path may have a strict rate-based rule at 100 requests per 5-minute window per IP, while /static/* has no WAF rules at all. This explains why your browser loads the marketing page fine but your scraper hits 429 on the same hostname.
2.Diagnosing CloudFront 403, 429, and Path-Specific Blocks
HTTP 403 from CloudFront arrives with an x-amz-cf-id header, a minimal response body (often a generic 'Request blocked' HTML page or a bare JSON object like {"message":"Forbidden"}), and no application-specific error codes. HTTP 429 Too Many Requests on API pagination is the classic rate-based rule signature: the response includes Retry-After and the body is empty or a generic message.
Bot Control CAPTCHA challenges appear as HTTP 405 or a redirect to an AWS-hosted CAPTCHA page. These occur after N requests within a configured window on paths where Bot Control is set to CAPTCHA action rather than COUNT or BLOCK. The threshold is site-specific and not publicly documented.
Path-specific behavior is your most reliable diagnostic signal. If the marketing HTML at / returns 200 while /api/v2/products returns 403 on the same domain, the block is almost certainly a WAF path condition rather than a network-level block. Confirm by checking response headers: x-amz-cf-pop indicates a CloudFront edge response; x-cache: Miss from cloudfront or Hit from cloudfront indicates the request reached cache evaluation; absence of any x-amz- headers suggests origin responded directly.
Application-layer 403s from the origin look different: they carry application-specific headers (Set-Cookie with session identifiers, Content-Type: application/json with a structured error body), and they survive when you replay the request with a valid session token. WAF 403s do not — the block happens before the origin sees the request.
3.Rate-Based Rules, Datacenter Egress, and Backoff Strategy
WAF rate-based rules count requests per IP in a rolling 5-minute window (the minimum granularity AWS WAF supports). A single Hetzner or DigitalOcean egress IP crawling 10,000 API pages will cross a 500 req/5min threshold in under three minutes. The rule fires, the IP is blocked for the remainder of the window, and every subsequent request from that IP returns 429 regardless of path.
The correct response to a 429 with Retry-After is to stop all requests from that IP for the specified duration, then resume at a lower rate. Doubling concurrency after a 429 is the single most common mistake — it extends the block window and may trigger secondary IP reputation rules that persist longer than the original rate window.
Distributing requests across residential IPs via OmniScrape spreads the per-IP count across many addresses, keeping each well below the threshold. The implementation below shows exponential backoff with jitter at the API client level. Jitter prevents synchronized retry storms across concurrent workers.
123456789101112131415161718192021222324252627282930313233import time, random, requests, os
API = "https://api.omniscrape.io/v1/scrape"
KEY = os.environ["OMNISCRAPE_KEY"]
def scrape_with_backoff(url: str, attempt: int = 0) -> dict:
resp = requests.post(
API,
headers={"X-API-Key": KEY},
json={
"url": url,
"mode": "auto",
"proxy": "residential:us",
"enable_solver": True,
"output_format": "html",
},
timeout=120,
)
if resp.status_code == 429 and attempt < 5:
retry_after = int(resp.headers.get("Retry-After", 0))
backoff = max(retry_after, 2 ** attempt) + random.uniform(0, 1)
time.sleep(backoff)
return scrape_with_backoff(url, attempt + 1)
resp.raise_for_status()
body = resp.json()
if not body.get("success"):
raise RuntimeError(f"Scrape failed: {body}")
# HTML content is at body["data"]["content"]
return body
result = scrape_with_backoff("https://example-saas.com/api/v2/products?page=1")
html_content = result["data"]["content"]
print(f"Method used: {result['metadata']['method_used']}")
4.AWS WAF Bot Control Tokens and JavaScript Challenges
Bot Control's targeted inspection level requires a JavaScript-generated challenge token embedded in requests. The WAF SDK injects JavaScript into the page that runs a browser fingerprinting and proof-of-work challenge, then sets a cookie or header token. Requests without a valid token are labeled awswaf:managed:token:absent and blocked or CAPTCHAed depending on rule action.
A plain HTTP request — curl, Python requests, or any non-browser client — lacks the JavaScript runtime to acquire this token. The WAF sees the absent label and blocks before the origin responds. This is why some pages that load fine in a browser return 403 immediately from a script.
Setting mode to js_rendering in OmniScrape runs the page in a headless Chromium instance that executes the challenge JavaScript, acquires the token, and attaches it to subsequent requests within the same session. Combining this with enable_solver: true handles CAPTCHA variants that Bot Control may serve. Use js_wait_selector to confirm the challenge has resolved before the response is captured.
Not every Bot Control block requires js_rendering. If the block is purely rate-based (the IP crossed a threshold), the token is irrelevant — the IP is blocked regardless. Diagnose first: if a fresh residential IP gets 403 on the first request, it is likely a token or fingerprint issue. If it gets 200 for the first 100 requests then 429, it is rate-based.
5.JSON APIs Are Not WAF-Exempt
A common assumption is that endpoints returning application/json are 'API endpoints' that bypass WAF because WAF is for web pages. This is incorrect. CloudFront applies WAF rules to every HTTP request regardless of Accept or Content-Type headers. A JSON API at /api/v2/products on a CloudFront distribution is subject to the same Bot Control and rate-based rules as the HTML pages on the same distribution.
The WAF 403 on a JSON API often looks like a valid JSON error — some origins are configured to return {"error":"Forbidden"} — making it easy to confuse with an application-level authorization failure. The distinguishing factor is the x-amz-cf-id header: if it is present on a 403, the block happened at the edge. If it is absent, the origin responded.
Also check x-amz-cf-pop, which identifies the CloudFront point of presence that handled the request. An edge-generated 403 will have this header; an origin-generated 403 typically will not (unless the origin explicitly sets it, which is unusual).
For API endpoints that require authentication, test the URL without credentials first. If you get 403 without credentials and 200 with credentials, the block is application-layer authorization. If you get 403 with valid credentials, the block is WAF — credentials never reached the origin.
6.Distributing Pagination Across Residential IPs
The architectural fix for rate-based rules is request distribution: partition the URL list across multiple workers, each using a different residential IP via OmniScrape's proxy pool. Per-worker concurrency stays low (1–2 req/s), but aggregate throughput scales with worker count without any single IP accumulating enough requests to trigger the rate rule.
OmniScrape rotates IPs within the residential pool automatically when it detects block signatures — 429 responses, empty bodies on expected JSON paths, or CAPTCHA redirects. This rotation is transparent to your worker code. Pair it with job-level Retry-After respect: when a worker receives 429, it backs off and lets OmniScrape select a fresh IP for the next attempt.
For large catalogs, a queue-based architecture works well: push all target URLs into a queue (SQS, Redis, or a database table), spin up N workers that each pull URLs and call OmniScrape, and track completion and retry state in the queue. This decouples URL discovery from fetching and makes it straightforward to add workers without changing the scraping logic.
Session continuity matters for authenticated scrapes: if the target site issues session cookies after login, use OmniScrape's session_id parameter to pin a sequence of requests to the same IP and cookie jar. Mixing session cookies across IPs will trigger session fixation or geographic anomaly checks on the application layer, independent of WAF.
7.When the Fast Lane Is Sufficient
Not every path on a WAF-protected domain requires js_rendering and a residential proxy. Static marketing pages, public documentation, and open blog content on the same CloudFront distribution may have no WAF rules or only IP reputation checks. Forcing js_rendering on these paths wastes browser compute and increases latency.
mode: auto is the correct default. OmniScrape attempts a fast HTTP request first. If the response indicates a block (403, 429, CAPTCHA redirect, or a body that does not match the expected structure), it escalates to js_rendering with the configured proxy and solver settings. This means you pay browser-rendering costs only for requests that actually need it.
Route by URL pattern in your worker when you have clear knowledge of which paths are strict. If /api/* always requires solver and residential proxy while /blog/* never does, configure two request profiles and apply them based on the URL prefix. This reduces cost and latency on the open paths without sacrificing reliability on the protected ones.
8.Logging WAF vs Origin Status for Diagnosis
Store metadata from every OmniScrape response alongside your scraped data. The fields that matter for WAF diagnosis are: the HTTP status code of the final response, metadata.method_used (fast or js_rendering), metadata.solver_used, metadata.challenge_solved, and the response body length. A WAF block typically produces a short body (under 500 bytes) with a 403 or 429 status. An origin 404 or 401 will have a different body structure.
Track the ratio of 403/429 responses per domain and per URL path over time. A sudden increase in 403s on a path that was previously returning 200 indicates a WAF rule change — the site operator may have tightened Bot Control thresholds or added a new rate rule. This is a signal to adjust your request rate or proxy configuration, not to increase concurrency.
OmniScrape's usage dashboard logs request outcomes by API key. Correlate your application logs with the dashboard to identify which URL patterns are consuming the most retries and browser-rendering credits. Paths with high retry rates are candidates for configuration tuning: lower concurrency, different proxy region, or js_wait_selector adjustments.
Log billing.charged and billing.balance_after from each response if you are operating near credit limits. WAF-heavy scrapes with frequent solver activations consume credits faster than simple HTML fetches. Knowing the per-URL cost distribution helps budget accurately.
9.Common AWS WAF Scraping Mistakes
Ignoring Retry-After on 429 responses is the most damaging mistake. Continuing to send requests from a blocked IP extends the block window and may escalate to a longer-duration IP reputation block. Always parse the Retry-After header and enforce a hard pause before retrying from the same IP.
Using a single datacenter egress IP for a full catalog scrape. One IP, one rate window, one block. Even a modest rate limit of 500 req/5min means you can fetch at most 6,000 pages per hour from a single IP — far below what most catalog scrapes require. Residential IP distribution is not optional for high-volume scrapes against WAF-protected targets.
Assuming JSON endpoints skip WAF. CloudFront applies WAF rules to all HTTP traffic on the distribution. The Content-Type of the response does not affect WAF evaluation.
Confusing application-layer 403 with edge-layer 403. The remediation is different: an application 403 requires fixing credentials or session state; a WAF 403 requires fixing IP, rate, or token. Checking for x-amz-cf-id in the response headers is the fastest way to distinguish them.
Forcing js_rendering on every URL regardless of whether it is needed. This increases latency, consumes more credits, and provides no benefit on paths that WAF allows through on plain HTTP. Use mode: auto and let OmniScrape escalate only when necessary.
Not testing with a fresh residential IP before assuming the block is unsolvable. WAF rate blocks are temporary. If you wait out the window and retry with a clean IP at low rate, you will often get 200. If you still get 403 on the first request from a fresh IP, the block is fingerprint or token-based, not rate-based — and that changes the fix.
Frequently asked questions
How do I know AWS WAF blocked me versus the origin server?
Check for the x-amz-cf-id response header. If it is present on a 403 or 429, the block was generated at the CloudFront edge — WAF evaluated the request and rejected it before forwarding to origin. If x-amz-cf-id is absent, or if the response contains application-specific error structures (session tokens, structured JSON error codes), the origin responded. You can also compare the response body size: WAF blocks tend to be very short (under 500 bytes), while origin errors usually include more context.
What request rate is safe for WAF-protected APIs?
There is no universal safe rate — it depends on the specific rate-based rule configured by the site operator, which is not publicly disclosed. A practical starting point is 1 request per 2 seconds per IP, monitoring the 429 rate. If you see no 429s after 200 requests, you can gradually increase. Distribute across multiple residential IPs via OmniScrape so that aggregate throughput scales without any single IP approaching the threshold. The goal is to keep per-IP request count well below whatever the window limit is.
Does enable_solver fix AWS WAF Bot Control blocks?
It depends on the block type. Bot Control blocks caused by missing JavaScript tokens (awswaf:managed:token:absent) or CAPTCHA challenges are resolved by enable_solver combined with mode: js_rendering — OmniScrape executes the challenge JavaScript in a headless browser and acquires the token. Pure rate-based blocks (IP exceeded request threshold) are not fixed by the solver; they require waiting out the block window and distributing requests across different IPs. Diagnose first: if a fresh residential IP gets blocked on its first request, it is likely a token issue. If it gets through for a while then hits 429, it is rate-based.
Can I whitelist my scraper IP in AWS WAF?
Only if you control the AWS account that owns the WAF. You would add your IP to an IP set and create a rule that ALLOWs requests from that set before the managed rule groups evaluate. For third-party SaaS products or public sites you do not own, you have no access to their WAF configuration. In that case, residential IP distribution is the correct approach — no single IP accumulates enough requests to trigger rate rules.
Which OmniScrape proxy setting works best for US SaaS dashboards?
proxy: residential:us is the most effective starting point for US-hosted SaaS dashboards. Bot Control and rate rules are often tuned with the expectation of US residential traffic, so datacenter IPs and non-US geolocations are more likely to trigger stricter rule actions. Test with mode: auto first — OmniScrape will use a fast HTTP request and escalate to js_rendering with the residential proxy only if the initial attempt is blocked.
How do I handle AWS WAF CAPTCHA challenges in a scraping pipeline?
Set mode: js_rendering and enable_solver: true in your OmniScrape request. The headless browser executes the Bot Control challenge JavaScript, and the solver handles the CAPTCHA if one is presented. Use js_wait_selector to specify a CSS selector that appears only after the challenge resolves — for example, a product grid or API response container. This ensures OmniScrape waits for the post-challenge content rather than returning the challenge page itself.
What is the difference between a CloudFront 403 and a CloudFront 429 from WAF?
A 403 Forbidden from WAF means the request matched a BLOCK rule action — the request type, IP reputation, token status, or fingerprint triggered an outright rejection. A 429 Too Many Requests means a rate-based rule fired — the IP exceeded the configured request count within the rolling window. The remediation differs: 403 blocks require changing IP, acquiring a token, or solving a challenge; 429 blocks require waiting out the Retry-After window and reducing per-IP request rate. Both can appear on the same domain at different thresholds.
Related guides