OmniScrape
ProductsSolutionsGuidesDocs ↗PricingAbout
ProductsSolutionsGuidesDocs ↗PricingAbout
← All guides
Scraping Tools

Scrapy Web Scraping with OmniScrape: Download Middleware, Pipelines, and Scale

Scrapy is the production-grade Python crawler: Twisted-backed concurrency, deduplication, configurable politeness, item pipelines, and a middleware stack you can hook at every layer. For internal or friendly sites it is unbeatable. For anything behind Cloudflare, PerimeterX, or a heavy SPA, the default Twisted HTTP downloader fails the same way a plain requests.get call does — wrong TLS fingerprint, no challenge solver, no JavaScript engine.

The architecture that works: Scrapy owns discovery, queuing, deduplication, and export; OmniScrape owns the actual HTTP fetch for protected or JavaScript-heavy URLs via a custom download middleware (Pattern A). The spider's parse() methods never change — they still receive an HtmlResponse. For the rare case of login flows or infinite-scroll discovery that requires real browser interaction, Pattern B delegates that work outside CrawlSpider entirely. See e-commerce scraping for how this fits into a full catalog pipeline, and web scraping with Python for the non-Scrapy baseline.

On this page

1. When to use Scrapy2. Where default Scrapy breaks3. Pattern A architecture4. Pattern A middleware code5. Spider using the middleware6. Pattern B — interactive flows outside CrawlSpider7. Politeness still matters8. Item pipelines9. Scaling workers10. Pre-launch checklist11. FAQ

1.When to use Scrapy

Scrapy earns its complexity budget when you have thousands of URLs, need deduplication across runs, want structured item exports to JSON/CSV/Parquet, and have a team already familiar with spiders and pipelines. For fewer than a few hundred URLs or one-off scripts, plain Python with the OmniScrape API is faster to write and easier to debug.

The sweet spot: site-wide crawls where URL discovery, politeness, and export are the hard parts — and OmniScrape handles the protected fetches transparently inside the downloader.

  • Site-wide discovery with CrawlSpider link extraction rules
  • Item pipelines for validation, deduplication, and cleaning
  • Per-domain concurrency caps and download delay
  • Feed exports to S3, GCS, or local Parquet via custom pipeline
  • Shared Redis queue for multi-process horizontal scale
  • Built-in stats collection for monitoring throughput and error rates

2.Where default Scrapy breaks

Scrapy's Twisted downloader sends requests with a stock Python TLS fingerprint and no browser-like header profile. Retail, travel, and SERP sites fingerprint TLS client hellos and block non-browser stacks within seconds. Scrapy has no built-in challenge solver for JavaScript-based bot checks.

JavaScript-heavy listing pages — where products are injected by React or Vue after the initial HTML loads — return an empty shell to Scrapy's downloader. The spider's CSS selectors find nothing, and items come back empty with no obvious error. These failures are silent and expensive to debug at scale.

  • Cloudflare JS challenge blocks the default HttpCompressionMiddleware path
  • Empty item lists when SPA listings are client-side rendered
  • No native proxy rotation — stock downloader leaks datacenter IPs
  • Spider logic tightly coupled to Scrapy runtime, making fetch logic hard to unit test
  • No retry intelligence for soft 403s that return 200 with a CAPTCHA page body

3.Pattern A architecture

Pattern A keeps Scrapy's architecture intact. A custom download middleware sits above the default HTTP handler in the DOWNLOADER_MIDDLEWARES priority stack. When a request arrives with the omniscrape meta flag set (or by default), the middleware intercepts it, POSTs the URL to the OmniScrape API, and returns an HtmlResponse constructed from the API response. Scrapy's scheduler, deduplicator, and pipelines never know the fetch happened externally.

Spider parse() methods receive the same HtmlResponse they always have. If you use css_extractor output format, the middleware stashes the extracted dict in response.meta['css_extracted'] and returns a minimal HTML body — the spider yields the dict directly without any CSS selector logic. This keeps spider code clean and testable without a live API.

Set meta flags per-request to control mode: omit the flag for fast HTTP-only pages, set omniscrape_mode to 'js_rendering' for known JavaScript listings, and enable_solver for pages with active bot challenges. A URL regex map in the spider's start_requests is a clean way to assign modes without hardcoding them in the middleware.

4.Pattern A middleware code

Register OmniScrapeMiddleware in DOWNLOADER_MIDDLEWARES at priority 543 — above Scrapy's built-in HttpCompressionMiddleware (590) and RetryMiddleware (550), but below RedirectMiddleware (600). This ensures OmniScrape handles the raw request before Scrapy's own HTTP stack touches it.

The middleware reads mode, enable_solver, and css_selectors from request.meta, so individual requests can opt into js_rendering or solver without changing the middleware itself. Requests with meta omniscrape set to False fall through to the stock downloader — useful for sitemaps or friendly internal APIs that don't need the API.

myproject/middleware.py
python
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879import os
import json
import requests
from scrapy.http import HtmlResponse
from scrapy.exceptions import IgnoreRequest

class OmniScrapeMiddleware:
    API = "https://api.omniscrape.io/v1/scrape"
    KEY = os.environ["OMNISCRAPE_KEY"]

    def process_request(self, request, spider):
        # Opt out per-request: meta omniscrape=False uses stock downloader
        if request.meta.get("omniscrape") is False:
            return None

        body = {
            "url": request.url,
            "mode": request.meta.get("omniscrape_mode", "auto"),
            "output_format": "html",
        }

        # Upgrade to css_extractor when selectors are provided
        if selectors := request.meta.get("css_selectors"):
            body["output_format"] = "css_extractor"
            body["css_selectors"] = selectors

        # Enable solver for bot-protected pages
        if request.meta.get("enable_solver"):
            body["enable_solver"] = True

        # Optional residential proxy
        if proxy := request.meta.get("omniscrape_proxy"):
            body["proxy"] = proxy

        try:
            r = requests.post(
                self.API,
                headers={
                    "X-API-Key": self.KEY,
                    "Content-Type": "application/json",
                },
                json=body,
                timeout=120,
            )
            r.raise_for_status()
        except requests.RequestException as exc:
            spider.logger.error("OmniScrape request error %s: %s", request.url, exc)
            raise IgnoreRequest()

        data = r.json()

        if not data.get("success"):
            spider.logger.error(
                "OmniScrape failure %s — response: %s", request.url, data
            )
            raise IgnoreRequest()

        # Attach billing and method metadata for pipeline cost accounting
        request.meta["omniscrape_method"] = (
            data.get("metadata", {}).get("method_used")
        )
        request.meta["omniscrape_charged"] = (
            data.get("billing", {}).get("charged")
        )

        if css := data["data"].get("css_extracted"):
            # css_extractor mode — stash dict, return empty shell
            request.meta["css_extracted"] = css
            content = "<html></html>"
        else:
            # html mode — full page content
            content = data["data"]["content"]

        return HtmlResponse(
            url=request.url,
            body=content.encode("utf-8"),
            encoding="utf-8",
            request=request,
        )

5.Spider using the middleware

The spider sets css_selectors in meta for product detail pages. parse_product checks for css_extracted first — if present, it yields the dict directly. The fallback CSS path handles any URL that slipped through without the extractor (for example, pages where the middleware fell back to html mode due to a selector mismatch).

custom_settings on the spider class keeps middleware registration close to the spider that needs it, rather than in settings.py — useful when only one spider in the project uses OmniScrape.

myproject/spiders/products.py
python
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162import scrapy

PRODUCT_SELECTORS = {
    "title": "h1",
    "price": "[data-price]",
    "sku": "[data-sku]",
    "availability": ".stock-status",
    "image_url": "img.product-hero::attr(src)",
}

class ProductSpider(scrapy.Spider):
    name = "products"
    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            "myproject.middleware.OmniScrapeMiddleware": 543,
            # Disable Scrapy's built-in HTTP downloader for these requests
            "scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": None,
        },
        "CONCURRENT_REQUESTS_PER_DOMAIN": 4,
        "DOWNLOAD_DELAY": 0.5,
        "DOWNLOAD_TIMEOUT": 130,  # slightly above OmniScrape timeout
        "RETRY_ENABLED": False,   # middleware handles retries via IgnoreRequest
    }

    def __init__(self, product_urls=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.product_urls = product_urls or []

    def start_requests(self):
        for url in self.product_urls:
            yield scrapy.Request(
                url,
                meta={
                    "css_selectors": PRODUCT_SELECTORS,
                    "enable_solver": True,
                    "omniscrape_proxy": "residential:us",
                },
                callback=self.parse_product,
                errback=self.handle_error,
            )

    def parse_product(self, response):
        if extracted := response.meta.get("css_extracted"):
            yield {
                **extracted,
                "_url": response.url,
                "_method": response.meta.get("omniscrape_method"),
                "_charged": response.meta.get("omniscrape_charged"),
            }
        else:
            # Fallback: css_extractor was not used or returned nothing
            yield {
                "title": response.css("h1::text").get("").strip(),
                "price": response.css("[data-price]::text").get("").strip(),
                "sku": response.css("[data-sku]::text").get("").strip(),
                "_url": response.url,
                "_method": response.meta.get("omniscrape_method"),
                "_charged": response.meta.get("omniscrape_charged"),
            }

    def handle_error(self, failure):
        self.logger.error("Failed: %s — %s", failure.request.url, failure.value)

6.Pattern B — interactive flows outside CrawlSpider

Scrapy is a poor fit for flows that require real browser interaction: login forms with CSRF tokens, infinite scroll discovery, or multi-step checkout funnels. Forcing CrawlSpider to handle these produces brittle code that fights the framework.

Pattern B keeps Scrapy for what it does well — queuing, deduplication, export — and delegates interactive discovery to a separate process. That process uses Playwright (or a BaaS endpoint) to drive a real browser, extracts discovered product URLs, and pushes them into the shared Redis queue that Scrapy workers drain. The two processes are decoupled: the browser process can restart independently, and Scrapy's deduplication filter prevents double-processing URLs that appear in both discovery and sitemap paths.

Use Pattern B only when you genuinely need click or scroll interactions for discovery. Most retail crawls can avoid it entirely by combining sitemap parsing (no OmniScrape needed) with OmniScrape-fetched PDPs.

discovery/infinite_scroll.py
python
12345678910111213141516171819202122232425262728293031323334353637# discovery/infinite_scroll.py — runs independently of Scrapy
# Uses Playwright to scroll a category page, collects product URLs,
# and pushes them to the shared Redis queue for Scrapy workers.

import asyncio
import redis
from playwright.async_api import async_playwright

REDIS_KEY = "scrapy:products:start_urls"
CATEGORY_URL = "https://example.com/category/shoes"

async def discover_urls():
    r = redis.Redis(host="localhost", decode_responses=True)
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(CATEGORY_URL)

        seen = set()
        for _ in range(20):  # scroll up to 20 times
            links = await page.eval_on_selector_all(
                "a.product-card", "els => els.map(e => e.href)"
            )
            new_links = [l for l in links if l not in seen]
            if not new_links:
                break
            for link in new_links:
                r.rpush(REDIS_KEY, link)
                seen.add(link)
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(1500)

        await browser.close()
    print(f"Discovered {len(seen)} URLs → Redis")

asyncio.run(discover_urls())
# Scrapy workers drain REDIS_KEY with scrapy-redis RedisSpider

7.Politeness still matters

OmniScrape solves bot challenges and rotates proxies — it does not grant permission to hammer a site at maximum concurrency. Aggressive crawl rates cause collateral damage to the target and increase your API spend without proportional throughput gains.

Keep CONCURRENT_REQUESTS_PER_DOMAIN at 4–8 and DOWNLOAD_DELAY at 0.5–1.0 seconds for most retail targets. If OmniScrape returns HTTP 429, that is a signal to reduce concurrency, not to retry immediately. Add exponential backoff in the middleware's error handler rather than relying on Scrapy's built-in RetryMiddleware, which does not understand API rate limits.

Respect robots.txt for discovery phases (CrawlSpider obeys it by default). OmniScrape fetches are server-side, so they still hit the target — politeness settings apply equally.

8.Item pipelines

Pipelines are where you enforce data quality before items reach storage. A price validation pipeline should reject items with empty or non-numeric price fields and push the source URL to a dead-letter queue for manual review or retry. Do not silently drop items — log the rejection with the URL so you can correlate against OmniScrape billing records.

Attach omniscrape_method and omniscrape_charged from response.meta to every item (as shown in the spider example). A cost accounting pipeline can aggregate these fields by domain and write them to a warehouse table — useful for finance and for identifying which domains consume disproportionate API credits.

For high-volume pipelines writing to databases, use Scrapy's ITEM_PIPELINES priority ordering to run validation (low number, runs first) before storage (high number, runs last). This avoids writing invalid items to the database even if the storage pipeline has no schema enforcement.

9.Scaling workers

Run multiple Scrapy processes using scrapy-redis: replace the default scheduler and dupefilter with Redis-backed equivalents, and each worker drains the same URL queue while sharing a deduplication set. Workers are stateless and can be added or removed without pausing the crawl.

OmniScrape is stateless on the API side — you can scale Scrapy workers horizontally until you hit your API plan's concurrency limit or start seeing 429 responses. When 429s appear, reduce CONCURRENT_REQUESTS globally or per-domain rather than adding more workers. More workers with the same concurrency cap just increases queue wait time without improving throughput.

For very large crawls (millions of URLs), partition the URL space by domain or category prefix and assign partitions to dedicated worker groups. This keeps per-domain concurrency predictable and makes it easier to pause or reprioritize specific domains without affecting the rest of the crawl.

10.Pre-launch checklist

Run through this list before pushing a new spider to production. Most production incidents with Scrapy + OmniScrape integrations trace back to missing timeout alignment, silent item drops, or missing dead-letter handling.

Review DOWNLOAD_TIMEOUT in settings — it must be greater than the timeout passed to requests.post in the middleware (120 s), otherwise Scrapy will cancel the request before OmniScrape responds. Set DOWNLOAD_TIMEOUT to 130 or higher.

  • CrawlSpider for sitemap/link discovery; OmniScrape middleware only for PDP and protected pages
  • Set omniscrape_mode to js_rendering in meta for known JavaScript-rendered listing pages
  • Set enable_solver: True in meta for pages with active bot challenges (Cloudflare, PerimeterX)
  • Dead-letter queue for URLs where middleware raises IgnoreRequest — do not silently discard
  • Price and SKU validation pipeline rejects empty items before storage pipeline runs
  • Export omniscrape_charged per item to warehouse for cost accounting by domain
  • DOWNLOAD_TIMEOUT in settings.py set above middleware requests.post timeout (120 s)
  • Unit tests for middleware using mocked requests.post with fixture JSON — no live API in CI
  • Redis-backed scheduler and dupefilter configured before scaling beyond one worker process
  • CONCURRENT_REQUESTS_PER_DOMAIN capped at 4–8; DOWNLOAD_DELAY at 0.5 s minimum

Frequently asked questions

Should I replace Scrapy's entire downloader or use the middleware per-spider?

Use the middleware with the meta opt-out flag (omniscrape=False). This lets friendly internal URLs, sitemaps, and robots.txt fetches use Scrapy's stock downloader at full speed, while protected product pages go through OmniScrape. Replacing the entire downloader forces all requests through the API, including ones that don't need it, which increases cost and latency unnecessarily.

How does OmniScrape compare to scrapy-splash for JavaScript rendering?

Splash is a self-hosted Lua-scriptable browser — you own the infrastructure, the proxy rotation, and the bot detection evasion. When a site blocks your Splash instance, you debug TLS fingerprints and headers yourself. OmniScrape is a managed API: challenge solving, proxy rotation, and browser fingerprinting are handled server-side. The tradeoff is cost per request versus operational overhead. For most production crawls, managed is cheaper in engineering time.

Can I use async Scrapy (2.x) with an async HTTP client instead of requests?

Yes. The middleware example uses synchronous requests for readability, but Scrapy 2.x supports async download middleware. Replace the requests.post call with await httpx.AsyncClient().post(...) and make process_request a coroutine. Ensure the middleware is registered correctly for async — Scrapy detects coroutine middleware automatically in 2.x.

How does css_extractor mode work in the middleware?

When css_selectors is present in request.meta, the middleware sets output_format to css_extractor and passes the selector map to the API. OmniScrape runs the selectors server-side and returns a dict in data.css_extracted. The middleware stashes this dict in response.meta['css_extracted'] and returns a minimal HTML shell. The spider's parse method yields the dict directly — no CSS parsing in the spider, no dependency on exact HTML structure in tests.

What happens when OmniScrape returns success: false?

The middleware logs the failure with the URL and raises IgnoreRequest, which signals Scrapy to drop the request without triggering the retry middleware. Push the URL to a Redis dead-letter set in the middleware's error handler so you can inspect and requeue failed URLs manually. Do not rely on Scrapy's built-in retry for API failures — it will retry with the same parameters and fail again.

How do I handle session-based crawls (login required) in Scrapy with OmniScrape?

Pass session_id in the request body via request.meta['omniscrape_session']. The middleware reads this and includes it in the API payload. OmniScrape will reuse the same browser session for requests sharing a session_id, preserving cookies and local storage across requests. Limit session reuse to the same domain and rotate session IDs periodically to avoid session fingerprinting.

How do I monitor crawl cost in real time?

Attach billing.charged from the API response to each item as a metadata field, as shown in the spider example. A lightweight pipeline aggregates charged values by domain and writes totals to a metrics store (Redis counters work well). Set a Scrapy extension that reads these counters and logs a cost summary in spider_closed. For finance reporting, write the per-item cost rows to a warehouse table alongside the scraped data.

Related guides

  • Web Scraping with Python
  • E-commerce Web Scraping: Catalog Intelligence at Production Scale
  • OmniScrape vs Apify

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

Ready to get started?

Start scraping protected sites today — no credit card required.

OmniScrape

Web scraping infrastructure for developers. One API call to bypass any protection.

All systems operational

Product

  • Web Unlocker
  • Browser-as-a-Service
  • Residential Proxies
  • Pricing

Developers

  • API Reference ↗
  • Quickstart ↗
  • All Guides
  • Use Cases
  • Status

Company

  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • Cookie Policy
  • Acceptable Use

Solutions

  • E-commerce Web Scraping: Catalog Intelligence at Production Scale
  • Real Estate Web Scraping: Listings, Comps, and Market Data
  • SERP Web Scraping: Agency Rank Tracking Workflow
  • Job Board Web Scraping: HR Tech Pipeline for Labor Market Intelligence
  • Price Monitoring with Web Scraping: A Practical Developer Guide
  • Lead Generation Web Scraping: Compliant Inbound Enrichment for Sales Teams
  • Market Research Web Scraping: Multi-Geo Data Collection for Research Firms
  • Sentiment Analysis Web Scraping: Build a Production Review Pipeline
  • Logistics Web Scraping: Carrier Rates, Port ETAs, and Sailing Schedules
  • Social Media Web Scraping: Brand Mention Monitoring from Public Pages
  • LLM Training Data Scraping: Building Clean Web Corpora
  • Travel Web Scraping: Hotel Rates, Flight Fares & Parity Monitoring

Web Scraping by Language

  • Web Scraping with Python
  • Web Scraping with Node.js: fetch, Cheerio, and the OmniScrape API
  • Web Scraping with Java: HttpClient, Jsoup, and OmniScrape API
  • Web Scraping with PHP
  • Web Scraping with Go (Golang)
  • Web Scraping with Ruby: Faraday, Nokogiri, Sidekiq & OmniScrape
  • Web Scraping with C#: HttpClient, AngleSharp, and OmniScrape API
  • Web Scraping with Rust
  • Web Scraping with R: httr2, rvest, and the OmniScrape API
  • Web Scraping with C++
  • Web Scraping with Elixir
  • Web Scraping with Perl: Mojo::UserAgent, Mojo::DOM, and OmniScrape

Anti-Bot Bypass

  • How to Bypass Cloudflare When Web Scraping
  • How to Bypass DataDome When Web Scraping
  • How to Bypass Akamai Bot Manager When Web Scraping
  • How to Bypass PerimeterX (HUMAN Security) When Web Scraping
  • Bypassing AWS WAF When Web Scraping: Rate Rules, Bot Control, and Residential Proxies
  • How to Bypass Imperva (Incapsula) When Web Scraping
  • How to Bypass Kasada Bot Protection When Web Scraping
  • How to Bypass F5 BIG-IP Bot Defense When Web Scraping
  • How to Bypass Distil Networks When Web Scraping
  • How to Bypass reCAPTCHA When Web Scraping

Scraping Tools

  • Playwright Web Scraping: Practical Patterns for Protected Sites
  • Puppeteer Web Scraping: Patterns, Anti-Bot Limits, and BaaS Integration
  • Selenium Web Scraping: Practical Patterns for Real-World Projects
  • Scrapy Web Scraping with OmniScrape: Download Middleware, Pipelines, and Scale
  • Beautiful Soup Web Scraping: A Practical Guide
  • cURL Web Scraping: Shell-Native Patterns with OmniScrape
  • HTTPX Web Scraping: Async Python with OmniScrape
  • Cheerio Web Scraping: A Practical Guide

Site-Specific Scrapers

  • Amazon Scraper: Product Data, Buy Box, Reviews, and Multi-Marketplace
  • Google Search Scraper: Extract SERP Rankings and Features
  • Google Maps Scraper: Extract Business Listings and Place Data
  • LinkedIn Scraper: Companies, Jobs, and Public Profiles
  • Walmart Scraper: Prices, Stock, Rollback Deals, and Fulfillment Data
  • eBay Scraper: Extract Listings, Auctions, and Sold Prices
  • Shopify Scraper: Products, Variants, and JSON Endpoints
  • Indeed Scraper: Extract Job Listings, Salaries, and Company Data
  • Zillow Scraper: Extract Listings, Zestimates, and Price History
  • Reddit Scraper: Posts, Comments, and Subreddit Data
  • X (Twitter) Scraper: Tweets, Profiles, and Hashtags
  • Instagram Scraper: Posts, Reels, and Profile Metrics
  • TikTok Scraper: Extract Videos, Hashtags, and Trend Data
  • YouTube Scraper: Extract Video Metadata, Comments, and Channel Stats
  • Booking.com Scraper: Hotel Rates, Room Types, and Availability
  • Airbnb Scraper: Listings, Calendars, and Nightly Rates
  • Crunchbase Scraper: Extract Funding Rounds, Companies, and Investors
  • Yelp Scraper: Extract Business Listings, Ratings, and Reviews
  • Glassdoor Scraper: Employer Ratings, Salaries, and Review Data
  • Trustpilot Scraper: TrustScore, Star Distribution, and Review Monitoring

How We Compare

  • OmniScrape vs ScrapingBee
  • OmniScrape vs ZenRows
  • OmniScrape vs ScraperAPI: A Practical Developer Comparison
  • OmniScrape vs Bright Data: Which Web Scraping Platform Fits Your Team?
  • OmniScrape vs Oxylabs
  • OmniScrape vs Smartproxy
  • OmniScrape vs Crawlbase: API Design, Observability, and Migration Guide
  • OmniScrape vs Apify

Web Scraping Guides

  • Web Scraping Without Getting Blocked
  • Web Scraping Proxy Guide: Types, Sessions, Geo, and OmniScrape Integration
  • Solve CAPTCHAs While Web Scraping
  • Web Scraping vs Web Crawling: Architecture, Patterns, and When to Use Each
  • Headless Browser Scraping: When to Use It and How to Do It Right
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns
  • Rotating Proxies for Web Scraping: Policies, Session Binding, and Geo Pools
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs

© 2026 OmniScrape. All rights reserved.

PrivacyTermsRefundsAcceptable Use