1.When to use Scrapy
Scrapy earns its complexity budget when you have thousands of URLs, need deduplication across runs, want structured item exports to JSON/CSV/Parquet, and have a team already familiar with spiders and pipelines. For fewer than a few hundred URLs or one-off scripts, plain Python with the OmniScrape API is faster to write and easier to debug.
The sweet spot: site-wide crawls where URL discovery, politeness, and export are the hard parts — and OmniScrape handles the protected fetches transparently inside the downloader.
- Site-wide discovery with CrawlSpider link extraction rules
- Item pipelines for validation, deduplication, and cleaning
- Per-domain concurrency caps and download delay
- Feed exports to S3, GCS, or local Parquet via custom pipeline
- Shared Redis queue for multi-process horizontal scale
- Built-in stats collection for monitoring throughput and error rates
2.Where default Scrapy breaks
Scrapy's Twisted downloader sends requests with a stock Python TLS fingerprint and no browser-like header profile. Retail, travel, and SERP sites fingerprint TLS client hellos and block non-browser stacks within seconds. Scrapy has no built-in challenge solver for JavaScript-based bot checks.
JavaScript-heavy listing pages — where products are injected by React or Vue after the initial HTML loads — return an empty shell to Scrapy's downloader. The spider's CSS selectors find nothing, and items come back empty with no obvious error. These failures are silent and expensive to debug at scale.
- Cloudflare JS challenge blocks the default HttpCompressionMiddleware path
- Empty item lists when SPA listings are client-side rendered
- No native proxy rotation — stock downloader leaks datacenter IPs
- Spider logic tightly coupled to Scrapy runtime, making fetch logic hard to unit test
- No retry intelligence for soft 403s that return 200 with a CAPTCHA page body
3.Pattern A architecture
Pattern A keeps Scrapy's architecture intact. A custom download middleware sits above the default HTTP handler in the DOWNLOADER_MIDDLEWARES priority stack. When a request arrives with the omniscrape meta flag set (or by default), the middleware intercepts it, POSTs the URL to the OmniScrape API, and returns an HtmlResponse constructed from the API response. Scrapy's scheduler, deduplicator, and pipelines never know the fetch happened externally.
Spider parse() methods receive the same HtmlResponse they always have. If you use css_extractor output format, the middleware stashes the extracted dict in response.meta['css_extracted'] and returns a minimal HTML body — the spider yields the dict directly without any CSS selector logic. This keeps spider code clean and testable without a live API.
Set meta flags per-request to control mode: omit the flag for fast HTTP-only pages, set omniscrape_mode to 'js_rendering' for known JavaScript listings, and enable_solver for pages with active bot challenges. A URL regex map in the spider's start_requests is a clean way to assign modes without hardcoding them in the middleware.
4.Pattern A middleware code
Register OmniScrapeMiddleware in DOWNLOADER_MIDDLEWARES at priority 543 — above Scrapy's built-in HttpCompressionMiddleware (590) and RetryMiddleware (550), but below RedirectMiddleware (600). This ensures OmniScrape handles the raw request before Scrapy's own HTTP stack touches it.
The middleware reads mode, enable_solver, and css_selectors from request.meta, so individual requests can opt into js_rendering or solver without changing the middleware itself. Requests with meta omniscrape set to False fall through to the stock downloader — useful for sitemaps or friendly internal APIs that don't need the API.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879import os
import json
import requests
from scrapy.http import HtmlResponse
from scrapy.exceptions import IgnoreRequest
class OmniScrapeMiddleware:
API = "https://api.omniscrape.io/v1/scrape"
KEY = os.environ["OMNISCRAPE_KEY"]
def process_request(self, request, spider):
# Opt out per-request: meta omniscrape=False uses stock downloader
if request.meta.get("omniscrape") is False:
return None
body = {
"url": request.url,
"mode": request.meta.get("omniscrape_mode", "auto"),
"output_format": "html",
}
# Upgrade to css_extractor when selectors are provided
if selectors := request.meta.get("css_selectors"):
body["output_format"] = "css_extractor"
body["css_selectors"] = selectors
# Enable solver for bot-protected pages
if request.meta.get("enable_solver"):
body["enable_solver"] = True
# Optional residential proxy
if proxy := request.meta.get("omniscrape_proxy"):
body["proxy"] = proxy
try:
r = requests.post(
self.API,
headers={
"X-API-Key": self.KEY,
"Content-Type": "application/json",
},
json=body,
timeout=120,
)
r.raise_for_status()
except requests.RequestException as exc:
spider.logger.error("OmniScrape request error %s: %s", request.url, exc)
raise IgnoreRequest()
data = r.json()
if not data.get("success"):
spider.logger.error(
"OmniScrape failure %s — response: %s", request.url, data
)
raise IgnoreRequest()
# Attach billing and method metadata for pipeline cost accounting
request.meta["omniscrape_method"] = (
data.get("metadata", {}).get("method_used")
)
request.meta["omniscrape_charged"] = (
data.get("billing", {}).get("charged")
)
if css := data["data"].get("css_extracted"):
# css_extractor mode — stash dict, return empty shell
request.meta["css_extracted"] = css
content = "<html></html>"
else:
# html mode — full page content
content = data["data"]["content"]
return HtmlResponse(
url=request.url,
body=content.encode("utf-8"),
encoding="utf-8",
request=request,
)
5.Spider using the middleware
The spider sets css_selectors in meta for product detail pages. parse_product checks for css_extracted first — if present, it yields the dict directly. The fallback CSS path handles any URL that slipped through without the extractor (for example, pages where the middleware fell back to html mode due to a selector mismatch).
custom_settings on the spider class keeps middleware registration close to the spider that needs it, rather than in settings.py — useful when only one spider in the project uses OmniScrape.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162import scrapy
PRODUCT_SELECTORS = {
"title": "h1",
"price": "[data-price]",
"sku": "[data-sku]",
"availability": ".stock-status",
"image_url": "img.product-hero::attr(src)",
}
class ProductSpider(scrapy.Spider):
name = "products"
custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"myproject.middleware.OmniScrapeMiddleware": 543,
# Disable Scrapy's built-in HTTP downloader for these requests
"scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": None,
},
"CONCURRENT_REQUESTS_PER_DOMAIN": 4,
"DOWNLOAD_DELAY": 0.5,
"DOWNLOAD_TIMEOUT": 130, # slightly above OmniScrape timeout
"RETRY_ENABLED": False, # middleware handles retries via IgnoreRequest
}
def __init__(self, product_urls=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.product_urls = product_urls or []
def start_requests(self):
for url in self.product_urls:
yield scrapy.Request(
url,
meta={
"css_selectors": PRODUCT_SELECTORS,
"enable_solver": True,
"omniscrape_proxy": "residential:us",
},
callback=self.parse_product,
errback=self.handle_error,
)
def parse_product(self, response):
if extracted := response.meta.get("css_extracted"):
yield {
**extracted,
"_url": response.url,
"_method": response.meta.get("omniscrape_method"),
"_charged": response.meta.get("omniscrape_charged"),
}
else:
# Fallback: css_extractor was not used or returned nothing
yield {
"title": response.css("h1::text").get("").strip(),
"price": response.css("[data-price]::text").get("").strip(),
"sku": response.css("[data-sku]::text").get("").strip(),
"_url": response.url,
"_method": response.meta.get("omniscrape_method"),
"_charged": response.meta.get("omniscrape_charged"),
}
def handle_error(self, failure):
self.logger.error("Failed: %s — %s", failure.request.url, failure.value)
6.Pattern B — interactive flows outside CrawlSpider
Scrapy is a poor fit for flows that require real browser interaction: login forms with CSRF tokens, infinite scroll discovery, or multi-step checkout funnels. Forcing CrawlSpider to handle these produces brittle code that fights the framework.
Pattern B keeps Scrapy for what it does well — queuing, deduplication, export — and delegates interactive discovery to a separate process. That process uses Playwright (or a BaaS endpoint) to drive a real browser, extracts discovered product URLs, and pushes them into the shared Redis queue that Scrapy workers drain. The two processes are decoupled: the browser process can restart independently, and Scrapy's deduplication filter prevents double-processing URLs that appear in both discovery and sitemap paths.
Use Pattern B only when you genuinely need click or scroll interactions for discovery. Most retail crawls can avoid it entirely by combining sitemap parsing (no OmniScrape needed) with OmniScrape-fetched PDPs.
12345678910111213141516171819202122232425262728293031323334353637# discovery/infinite_scroll.py — runs independently of Scrapy
# Uses Playwright to scroll a category page, collects product URLs,
# and pushes them to the shared Redis queue for Scrapy workers.
import asyncio
import redis
from playwright.async_api import async_playwright
REDIS_KEY = "scrapy:products:start_urls"
CATEGORY_URL = "https://example.com/category/shoes"
async def discover_urls():
r = redis.Redis(host="localhost", decode_responses=True)
async with async_playwright() as pw:
browser = await pw.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(CATEGORY_URL)
seen = set()
for _ in range(20): # scroll up to 20 times
links = await page.eval_on_selector_all(
"a.product-card", "els => els.map(e => e.href)"
)
new_links = [l for l in links if l not in seen]
if not new_links:
break
for link in new_links:
r.rpush(REDIS_KEY, link)
seen.add(link)
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1500)
await browser.close()
print(f"Discovered {len(seen)} URLs → Redis")
asyncio.run(discover_urls())
# Scrapy workers drain REDIS_KEY with scrapy-redis RedisSpider
7.Politeness still matters
OmniScrape solves bot challenges and rotates proxies — it does not grant permission to hammer a site at maximum concurrency. Aggressive crawl rates cause collateral damage to the target and increase your API spend without proportional throughput gains.
Keep CONCURRENT_REQUESTS_PER_DOMAIN at 4–8 and DOWNLOAD_DELAY at 0.5–1.0 seconds for most retail targets. If OmniScrape returns HTTP 429, that is a signal to reduce concurrency, not to retry immediately. Add exponential backoff in the middleware's error handler rather than relying on Scrapy's built-in RetryMiddleware, which does not understand API rate limits.
Respect robots.txt for discovery phases (CrawlSpider obeys it by default). OmniScrape fetches are server-side, so they still hit the target — politeness settings apply equally.
8.Item pipelines
Pipelines are where you enforce data quality before items reach storage. A price validation pipeline should reject items with empty or non-numeric price fields and push the source URL to a dead-letter queue for manual review or retry. Do not silently drop items — log the rejection with the URL so you can correlate against OmniScrape billing records.
Attach omniscrape_method and omniscrape_charged from response.meta to every item (as shown in the spider example). A cost accounting pipeline can aggregate these fields by domain and write them to a warehouse table — useful for finance and for identifying which domains consume disproportionate API credits.
For high-volume pipelines writing to databases, use Scrapy's ITEM_PIPELINES priority ordering to run validation (low number, runs first) before storage (high number, runs last). This avoids writing invalid items to the database even if the storage pipeline has no schema enforcement.
9.Scaling workers
Run multiple Scrapy processes using scrapy-redis: replace the default scheduler and dupefilter with Redis-backed equivalents, and each worker drains the same URL queue while sharing a deduplication set. Workers are stateless and can be added or removed without pausing the crawl.
OmniScrape is stateless on the API side — you can scale Scrapy workers horizontally until you hit your API plan's concurrency limit or start seeing 429 responses. When 429s appear, reduce CONCURRENT_REQUESTS globally or per-domain rather than adding more workers. More workers with the same concurrency cap just increases queue wait time without improving throughput.
For very large crawls (millions of URLs), partition the URL space by domain or category prefix and assign partitions to dedicated worker groups. This keeps per-domain concurrency predictable and makes it easier to pause or reprioritize specific domains without affecting the rest of the crawl.
10.Pre-launch checklist
Run through this list before pushing a new spider to production. Most production incidents with Scrapy + OmniScrape integrations trace back to missing timeout alignment, silent item drops, or missing dead-letter handling.
Review DOWNLOAD_TIMEOUT in settings — it must be greater than the timeout passed to requests.post in the middleware (120 s), otherwise Scrapy will cancel the request before OmniScrape responds. Set DOWNLOAD_TIMEOUT to 130 or higher.
- CrawlSpider for sitemap/link discovery; OmniScrape middleware only for PDP and protected pages
- Set omniscrape_mode to js_rendering in meta for known JavaScript-rendered listing pages
- Set enable_solver: True in meta for pages with active bot challenges (Cloudflare, PerimeterX)
- Dead-letter queue for URLs where middleware raises IgnoreRequest — do not silently discard
- Price and SKU validation pipeline rejects empty items before storage pipeline runs
- Export omniscrape_charged per item to warehouse for cost accounting by domain
- DOWNLOAD_TIMEOUT in settings.py set above middleware requests.post timeout (120 s)
- Unit tests for middleware using mocked requests.post with fixture JSON — no live API in CI
- Redis-backed scheduler and dupefilter configured before scaling beyond one worker process
- CONCURRENT_REQUESTS_PER_DOMAIN capped at 4–8; DOWNLOAD_DELAY at 0.5 s minimum
Frequently asked questions
Should I replace Scrapy's entire downloader or use the middleware per-spider?
Use the middleware with the meta opt-out flag (omniscrape=False). This lets friendly internal URLs, sitemaps, and robots.txt fetches use Scrapy's stock downloader at full speed, while protected product pages go through OmniScrape. Replacing the entire downloader forces all requests through the API, including ones that don't need it, which increases cost and latency unnecessarily.
How does OmniScrape compare to scrapy-splash for JavaScript rendering?
Splash is a self-hosted Lua-scriptable browser — you own the infrastructure, the proxy rotation, and the bot detection evasion. When a site blocks your Splash instance, you debug TLS fingerprints and headers yourself. OmniScrape is a managed API: challenge solving, proxy rotation, and browser fingerprinting are handled server-side. The tradeoff is cost per request versus operational overhead. For most production crawls, managed is cheaper in engineering time.
Can I use async Scrapy (2.x) with an async HTTP client instead of requests?
Yes. The middleware example uses synchronous requests for readability, but Scrapy 2.x supports async download middleware. Replace the requests.post call with await httpx.AsyncClient().post(...) and make process_request a coroutine. Ensure the middleware is registered correctly for async — Scrapy detects coroutine middleware automatically in 2.x.
How does css_extractor mode work in the middleware?
When css_selectors is present in request.meta, the middleware sets output_format to css_extractor and passes the selector map to the API. OmniScrape runs the selectors server-side and returns a dict in data.css_extracted. The middleware stashes this dict in response.meta['css_extracted'] and returns a minimal HTML shell. The spider's parse method yields the dict directly — no CSS parsing in the spider, no dependency on exact HTML structure in tests.
What happens when OmniScrape returns success: false?
The middleware logs the failure with the URL and raises IgnoreRequest, which signals Scrapy to drop the request without triggering the retry middleware. Push the URL to a Redis dead-letter set in the middleware's error handler so you can inspect and requeue failed URLs manually. Do not rely on Scrapy's built-in retry for API failures — it will retry with the same parameters and fail again.
How do I handle session-based crawls (login required) in Scrapy with OmniScrape?
Pass session_id in the request body via request.meta['omniscrape_session']. The middleware reads this and includes it in the API payload. OmniScrape will reuse the same browser session for requests sharing a session_id, preserving cookies and local storage across requests. Limit session reuse to the same domain and rotate session IDs periodically to avoid session fingerprinting.
How do I monitor crawl cost in real time?
Attach billing.charged from the API response to each item as a metadata field, as shown in the spider example. A lightweight pipeline aggregates charged values by domain and writes totals to a metrics store (Redis counters work well). Set a Scrapy extension that reads these counters and logs a cost summary in spider_closed. For finance reporting, write the per-item cost rows to a warehouse table alongside the scraped data.
Related guides