Web Scraping vs Web Crawling: Architecture, Patterns, and When to Use Each

1.Two distinct jobs, overlapping tools

Web scraping turns a known URL into structured fields — title, price, rating, inventory status, HTML body. The input is a URL you already decided to fetch; the output is a record in your database. The problem is extraction: parsing, field validation, handling schema drift, and dealing with bot protection on that specific page.

Web crawling discovers URLs — following links from a seed set, parsing sitemap XML, respecting robots.txt directives, deduplicating canonical paths, and maintaining a frontier queue that tracks what has been visited and what remains. The primary output of a crawler is a list of URLs (and perhaps lightweight metadata like last-modified headers), not fully populated product records.

Colly, Scrapy CrawlSpider, Apache Nutch, and custom BFS workers are crawlers. A script that reads urls.csv and POSTs each row to an extraction API is a scraper. Many production pipelines are crawl-then-scrape: the crawler discovers PDP URLs from category pages and sitemaps, writes them to a queue, and a separate scrape worker enriches each one with structured data. Conflating the two leads to over-engineered crawlers doing unnecessary work, or under-engineered scrapers that break the moment a site adds a new category.

A useful mental model: crawling answers 'what exists on this site?', scraping answers 'what does this specific page contain?'. You need to know which question you are actually trying to answer before writing a line of code.

2.When scraping alone is sufficient

You have a URL list from a partner feed, an XML sitemap export, an internal database of known product identifiers, or a government data portal with stable URL patterns. The URL discovery problem is already solved — someone else did it or the structure is predictable enough to generate URLs programmatically.

Common scrape-only scenarios: monitoring a fixed set of competitor product pages for price changes, archiving specific government statutes or regulatory PDFs updated on a known schedule, extracting structured data from a set of news articles whose URLs come from an RSS feed, or enriching an internal catalog with third-party detail pages.

Scraping a focused list of high-value URLs with good session hygiene and appropriate proxy selection almost always costs less — in time, money, and operational complexity — than crawling an entire site poorly. If you need data from ten URLs, do not build a system that visits ten thousand to find them. Crawl infrastructure is not free: you pay in proxy bandwidth, compute, and the engineering overhead of maintaining a frontier queue, dedup store, and politeness scheduler.

3.When you genuinely need a crawler first

No URL list exists and you cannot construct one: you are mapping a new marketplace category structure with thousands of dynamically generated subcategories, or auditing a large content site where the full URL space is unknown. Sitemaps are absent, incomplete, or months out of date. The site has no official API and search engine indexes do not surface the pages you need.

Crawlers also make sense for continuous discovery — monitoring a site for newly added product listings, detecting when category hierarchies change, or building a link graph for SEO analysis. These are inherently open-ended problems where the URL space evolves and you need a system that tracks changes over time.

The critical risk with crawlers: mistakes multiply. One overly aggressive politeness setting, one missing deduplication check, or one unguarded infinite pagination loop multiplies across every URL in your frontier. A scraper that fails on one URL costs you one record; a crawler that triggers a rate limit on a category page can get your entire IP range blocked before your scrape workers have fetched a single PDP. Separate your crawl discovery rate from your detail fetch rate — crawl category pages conservatively, scrape PDPs through Web Unlocker with higher concurrency only where site policy and your testing confirm it is safe.

4.A production crawl-then-scrape architecture

Stage 1 — Discovery worker: BFS from category seed URLs, extracts hrefs matching your target URL pattern (e.g. /product/\d+ or /p/[a-z0-9-]+), writes URL + discovered_at + discovery_source to Postgres or a Redis sorted set. Respects robots.txt and enforces per-domain crawl delay. Runs at low concurrency — its job is URL collection, not speed.

Stage 2 — Deduplication and canonicalization: unique constraint on canonical URL (strip UTM params, normalize trailing slashes, resolve redirects). Store discovery source — sitemap, pagination link, or inline href — so you can later audit which discovery paths are most efficient and trim expensive ones. Log robots.txt disallow decisions for compliance.

Stage 3 — Scrape worker: pulls URLs from the queue (Postgres SKIP LOCKED or a Redis list), POSTs each to OmniScrape with css_extractor output, validates that required fields are non-empty, writes to your data warehouse, and marks the URL as processed with a timestamp. This worker is stateless and horizontally scalable — add instances to increase throughput without touching the discovery layer.

Stage 4 — Refresh scheduler: re-enqueues URLs on a cadence based on their content type. Price-sensitive PDPs might refresh every 6 hours; static content pages weekly. This is a separate concern from initial discovery and should not be mixed into the crawler logic.

scrape_worker.py

python

12345678910111213141516171819202122232425262728293031323334# Scrape stage only — URLs already discovered by crawler
import requests, os

API = "https://api.omniscrape.io/v1/scrape"
KEY = os.environ["OMNISCRAPE_KEY"]

def scrape_pdp(url: str) -> dict:
    r = requests.post(
        API,
        headers={"X-API-Key": KEY},
        json={
            "url": url,
            "mode": "auto",
            "output_format": "css_extractor",
            "enable_solver": True,
            "proxy": "residential",
            "css_selectors": {
                "title": "h1",
                "price": "[itemprop='price']",
                "sku": "[data-sku]",
                "availability": "[itemprop='availability']",
                "description": ".product-description",
            },
        },
        timeout=120,
    )
    r.raise_for_status()
    body = r.json()
    if not body.get("success"):
        raise ValueError(f"OmniScrape error for {url}: {body}")
    extracted = body["data"].get("css_extracted", {})
    method = body.get("metadata", {}).get("method_used", "unknown")
    print(f"[{method}] {url} -> {list(extracted.keys())}")
    return extracted

5.Politeness rules differ by pipeline stage

Crawlers must be conservative on link discovery. During a crawl you touch many URLs you may never actually scrape — category pages, pagination, faceted navigation. One request every 1–3 seconds per domain on category pages is a reasonable starting floor. Check robots.txt Crawl-delay directives and honor them. Log every 429 and 503 response and back off exponentially; do not retry immediately.

Scrape workers can run higher per-domain concurrency on PDP URLs if your use case and the site's terms allow it — each request has direct data value. But 'higher' is relative: even 5 concurrent requests to the same domain can trigger WAF rules on sensitive endpoints. Test with 2 concurrent, measure error rates, and scale up incrementally. Cap per-domain parallelism and use per-domain rate limit state shared across worker instances.

Never share a single politeness budget between your crawler and scraper. They hit different URL patterns, often different subdomains or CDN origins, and should be governed independently. A 429 on /category/shoes should not pause your scraper fetching /product/12345 — these are different rate limit buckets from the server's perspective.

For sites protected by AWS WAF or similar, route both stages through OmniScrape with enable_solver: true. The solver handles challenge pages without you needing to implement fingerprint rotation or cookie replay logic in your own workers.

6.Always check sitemaps and feeds before writing a recursive crawler

Before you write a single CrawlSpider rule, fetch /sitemap.xml, /sitemap_index.xml, /robots.txt (which often references sitemap locations), and any documented product feeds or APIs. Retailers frequently expose 80–95% of their PDP URLs in XML sitemaps while their category HTML is behind aggressive bot protection. Parsing a sitemap is an afternoon of work; building a robust recursive crawler that handles JavaScript-rendered pagination is weeks.

Official data feeds — Google Merchant Center exports, affiliate network feeds, supplier EDI files — often contain structured product data directly, making the scrape stage unnecessary for fields already in the feed. Use the feed for bulk data and scrape only for fields the feed omits (real-time price, inventory, review counts).

When you do crawl, instrument why each URL was enqueued: sitemap entry, pagination link, inline product href, or search result. This metadata lets you analyze which discovery paths are most efficient and prune expensive ones. If 90% of your PDPs come from sitemaps and 10% from crawling 500 category pages, you can reduce crawl scope dramatically once you have that data.

Crawl only what feeds and sitemaps miss. Your frontier should shrink over time as you identify reliable URL sources, not grow unbounded.

7.Scrapy: CrawlSpider discovers, middleware delegates extraction

A common Scrapy production pattern uses CrawlSpider rules to discover product links from category and pagination pages, then a custom download middleware intercepts each request whose URL matches a PDP pattern and POSTs it to OmniScrape instead of fetching it directly. The middleware returns a Scrapy Response object constructed from OmniScrape's response body, so your item parsers receive normal HTML and require no changes.

This separation is clean: Scrapy handles scheduling, deduplication, item pipelines, and feed exports; OmniScrape handles IP rotation, challenge solving, and browser rendering for protected PDPs. You get the orchestration strengths of Scrapy without maintaining your own proxy pool or browser fleet.

Keep DOWNLOAD_DELAY and AUTOTHROTTLE_ENABLED active in Scrapy settings even when OmniScrape handles IP rotation. You are still generating load on the target server — OmniScrape changes the apparent source IP, not the request volume. Scrapy's autothrottle also protects you from overwhelming your own OmniScrape concurrency quota.

For the discovery stage (category pages), you may not need OmniScrape at all if those pages are lightly protected. Fetch them directly with Scrapy's default downloader and reserve OmniScrape calls for PDP URLs where protection is heaviest. This keeps your API usage — and cost — proportional to actual extraction value.

8.Crawl-specific pitfalls that sink production pipelines

Infinite pagination: calendar archives (/archive/2020/01, /archive/2020/02, ...), faceted navigation with unbounded filter combinations (/shoes?color=red&size=10&brand=x), and session-parameterized URLs that generate unique paths for every visitor. Always set a maximum crawl depth and URL count per seed, and review frontier growth rate daily during initial development.

Missing deduplication: paying twice for the same PDP when it is discovered via two category paths, or re-scraping the same URL on every crawler run because you forgot to persist the visited set. Use canonical URL normalization (strip tracking params, normalize protocol and trailing slash) before inserting into your dedup store.

Treating crawl and scrape block rates as a single metric. Category pages might be completely open while PDPs are behind Akamai Bot Manager. A spike in discovery 403s is a crawler problem; a spike in extraction failures is a scraper problem. Instrument them separately with distinct metric labels.

Crawling sections of a site your compliance or legal review has not approved. Archive robots.txt decisions per domain in a compliance log with timestamps. If a site updates its robots.txt to disallow a path you were crawling, you need an audit trail showing when you stopped.

Storing raw HTML from the crawl stage and re-parsing it weeks later. HTML schemas drift. Extract what you need at scrape time and store structured records, not raw markup, as your source of truth.

9.Where OmniScrape fits in a crawl-then-scrape pipeline

OmniScrape is a stateless per-URL fetch and extraction service. It is purpose-built for the scrape stage of a crawl-then-scrape pipeline — parallel detail fetches after URL discovery is complete. It does not replace your frontier queue, link extractor, deduplication store, or crawl scheduler. Those remain your responsibility.

Use output_format css_extractor on detail pages to push field extraction server-side and receive structured JSON directly, skipping HTML parsing in your scrape workers entirely. This reduces worker memory usage and eliminates a class of parsing bugs caused by HTML schema changes — when the selector stops matching, you get an empty field rather than a stack trace.

Use mode auto as the default. OmniScrape will attempt a fast HTTP fetch first and escalate to a headless browser only when the response indicates JavaScript rendering is required. This keeps costs proportional to actual page complexity — a static product page on a CDN does not consume browser-render credits.

For JavaScript-rendered category pages that your crawler needs to parse for links, use mode js_rendering with js_wait_selector set to a CSS selector that appears only after the product grid has loaded. Apply this only to discovery URLs that genuinely require it — not to every PDP if server-rendered HTML is sufficient for extraction.

OmniScrape's metadata.method_used field tells you whether a given URL was served via fast HTTP or js_rendering. Track this distribution in your metrics. If 95% of your PDPs are served via fast, you have strong evidence that mode auto is correctly calibrated for that domain and you are not over-paying for browser renders.

10.Metrics to instrument at each pipeline stage

Crawler metrics: URLs discovered per hour (track against expected site size), duplicate rate (high rate means your dedup logic is broken or the site has URL proliferation issues), robots.txt disallow hits (compliance signal), discovery 4xx rate by status code (403 means bot protection, 404 means dead links in your seeds), and frontier queue depth over time.

Scraper metrics: field completeness rate per required field (title, price, sku separately — not a single 'success' boolean), cost per successfully extracted row (billing.charged from OmniScrape response), metadata.method_used distribution (fast vs js_rendering ratio), solver activation rate (metadata.solver_used), and end-to-end latency from queue enqueue to warehouse write.

Operational signals: when discovery 403 rate spikes, investigate crawler seeds, politeness settings, and proxy configuration before touching PDP scraper settings — they are different systems. When field completeness drops on a specific field, the target site changed its HTML schema, not your proxy or solver configuration. When metadata.method_used shifts from fast toward js_rendering, the site may have added client-side rendering to previously static pages.

Set alerts on field completeness, not just HTTP success rate. A 200 response that returns an empty price field is a silent failure more damaging than an explicit error — it writes a null record to your warehouse that downstream systems treat as authoritative.

Frequently asked questions

Is Scrapy a crawler or a scraper?

Both, depending on how you configure it. Scrapy spiders with CrawlSpider rules and link extractors act as crawlers — they follow links and discover URLs. Scrapy spiders with explicit start_urls and item parsers act as scrapers — they extract structured data from known pages. Most production setups use CrawlSpider rules for discovery and Item Pipelines for structured extraction, often with an external fetch layer like OmniScrape handling protected PDPs. The framework is neutral; the architecture you build on top of it determines which job it does.

Should the crawler or the scraper handle proxies?

Usually the scrape stage, where bot protection is heaviest. Category pages used for discovery are often lightly protected or open, so a simple rotating datacenter proxy pool may suffice for the crawler. PDP pages, especially on e-commerce sites, typically require residential proxies and challenge solving. Route scrape-stage requests through OmniScrape with enable_solver: true and proxy: 'residential' for protected domains. If your category pages are also behind Cloudflare or Akamai, route discovery fetches through OmniScrape as well — there is no rule that says the crawler must fetch directly.

How do I avoid crawling URLs I do not need?

Restrict CrawlSpider link extractors with allow and deny regex patterns that match only your target URL structure. Seed from sitemaps rather than homepages to skip navigation and marketing pages entirely. Cap pagination depth per category (max_pages parameter or a counter in your spider state). Set a maximum frontier size and alert when it approaches the limit. Review frontier growth rate daily during the first week of a new crawl — unbounded growth is always a configuration bug, not a feature.

Can one OmniScrape API call replace my entire crawler?

No. OmniScrape fetches and extracts one URL per request. It does not maintain a frontier queue, follow links across pages, deduplicate URLs, or schedule recurring crawls. You still need an orchestration layer — whether that is Scrapy, a custom BFS worker, a cron job reading from a sitemap, or a simple URL list — to decide which URLs to fetch. OmniScrape handles the hard parts of fetching a single URL reliably: IP rotation, challenge solving, JavaScript rendering, and structured extraction. The decision of what to fetch remains yours.

What is the most cost-effective architecture for 50,000 product URLs?

Start with the site's XML sitemap or a supplier feed for URL discovery — this is free and takes hours, not days. Feed those URLs into a scrape worker that POSTs each to OmniScrape with mode auto and output_format css_extractor. Run at modest concurrency (start at 5, measure error rates, scale up). Use residential proxy only on domains that block datacenter IPs — check this empirically by testing with mode fast first. Do not crawl the site's blog, help center, or marketing pages. The total engineering time for this architecture is one to two days; a recursive crawler for the same job would take two weeks and cost more in proxy bandwidth.

How do I handle sites where category pages require JavaScript but PDPs do not?

Use mode js_rendering with js_wait_selector for discovery fetches on JavaScript-rendered category pages, and mode auto (or mode fast if you have confirmed the PDPs are server-rendered) for PDP scrape fetches. OmniScrape's metadata.method_used response field will confirm whether fast HTTP was used. This keeps browser-render costs limited to the discovery stage where they are actually needed. In Scrapy, you can implement this by setting a custom Request meta flag on PDP requests that your download middleware uses to select the appropriate OmniScrape mode.

What response field contains the scraped HTML content?

OmniScrape returns HTML content in body.data.content — not body.data.html. When using output_format css_extractor, structured field values are in body.data.css_extracted as a key-value map corresponding to your css_selectors input. Always check body.success === true before reading data fields, and handle the case where css_extracted fields are empty strings (selector did not match) separately from HTTP errors.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.