Web Scraping with Python

1.Install what you need

You need two libraries to start: one to fetch pages, one to parse HTML. requests is the classic choice; httpx is the modern alternative with async support if you plan to run hundreds of concurrent fetches later.

terminal

bash

1234pip install requests beautifulsoup4 lxml

# Optional — async fetching at scale:
pip install httpx

2.Fetch a page with requests

Here is the simplest possible scraper. It works on sites with no bot protection — government open-data portals, small business sites, your own staging environment. Save this as scrape.py and run it.

Open the saved page.html in a browser. If you see the content you expected, your fetch layer is fine and you can move on to parsing. If you see a challenge page or a 403, skip ahead to the OmniScrape section — no amount of header tweaking will fix a Cloudflare-protected retailer long-term.

scrape.py

python

12345678910import requests

url = "https://books.toscrape.com/catalogue/page-1.html"
response = requests.get(url, timeout=30)
response.raise_for_status()

with open("page.html", "w", encoding="utf-8") as f:
    f.write(response.text)

print(f"Saved {len(response.text):,} bytes — status {response.status_code}")

3.Extract data with Beautiful Soup

Raw HTML is not useful until you turn it into rows. Beautiful Soup lets you query the DOM with CSS selectors — the same selectors you would use in browser DevTools.

books.toscrape.com is a deliberately scraper-friendly demo site. Real targets change their HTML without warning, so always validate that extracted fields are non-empty before saving to your database. An empty price string is worse than a crash — it silently poisons your dataset.

parse.py

python

1234567891011121314from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "lxml")

books = []
for card in soup.select("article.product_pod"):
    title = card.select_one("h3 a")["title"]
    price = card.select_one(".price_color").get_text(strip=True)
    availability = card.select_one(".instock").get_text(strip=True)
    books.append({"title": title, "price": price, "in_stock": "In stock" in availability})

print(f"Found {len(books)} books on this page")
for b in books[:3]:
    print(b)

4.Loop through every page

Most catalogs span multiple pages. The pattern is always the same: fetch, parse, look for a next-page link, repeat. Add a short sleep between requests when scraping directly — two seconds is a reasonable default for polite crawling.

When you route through an API (see below), the API handles politeness and IP rotation on its side, but you should still avoid firing ten thousand URLs in a tight loop from a single worker process.

paginate.py

python

123456789101112131415161718192021222324252627import time

all_books = []
page = 1

while True:
    url = f"https://books.toscrape.com/catalogue/page-{page}.html"
    r = requests.get(url, timeout=30)
    if r.status_code == 404:
        break

    soup = BeautifulSoup(r.text, "lxml")
    cards = soup.select("article.product_pod")
    if not cards:
        break

    for card in cards:
        all_books.append({
            "title": card.select_one("h3 a")["title"],
            "price": card.select_one(".price_color").get_text(strip=True),
        })

    print(f"Page {page}: {len(cards)} books (total so far: {len(all_books)})")
    page += 1
    time.sleep(2)

print(f"Done — {len(all_books)} books total")

5.When requests stops working

The moment you point requests at a Cloudflare-protected shop, a travel site behind Akamai, or a marketplace with DataDome, you get back challenge HTML instead of product data. You might patch it with cloudscraper or a headless Chrome instance — and it might work for a week. Then the protection vendor updates and you are debugging TLS fingerprints at midnight.

That is the point where teams switch to a scraping API. You keep your Python parsing logic; you replace the fetch line with a POST to OmniScrape. The API returns the real page HTML after solving whatever challenge was in the way. Read our Cloudflare bypass guide if you want to understand what happens on the other side of that request.

6.Fetch protected pages through OmniScrape

Same Python script, different fetch. Send the URL to POST https://api.omniscrape.io/v1/scrape with your API key in the X-API-Key header. Set mode to auto — the API tries a fast HTTP path first and only opens a real browser if the page needs it.

The response puts the finished HTML in data.content. Feed that into Beautiful Soup exactly like before. Your parsing code does not change.

protected_scrape.py

python

12345678910111213141516171819202122232425262728293031import os
import requests

API_KEY = os.environ["OMNISCRAPE_KEY"]
TARGET = "https://protected-shop.com/product/12345"

resp = requests.post(
    "https://api.omniscrape.io/v1/scrape",
    headers={"X-API-Key": API_KEY},
    json={
        "url": TARGET,
        "mode": "auto",
        "output_format": "html",
    },
    timeout=120,
)
resp.raise_for_status()
body = resp.json()

if not body["success"]:
    raise RuntimeError(f"Scrape failed: {body}")

html = body["data"]["content"]
method = body["metadata"]["method_used"]
cost = body["billing"]["charged"]

print(f"Got {len(html):,} bytes via {method} — cost ${cost:.4f}")

soup = BeautifulSoup(html, "lxml")
price = soup.select_one(".product-price")
print("Price:", price.get_text(strip=True) if price else "NOT FOUND — check selector")

7.Skip parsing — get JSON directly

If you know the CSS selectors upfront, you can ask OmniScrape to extract fields server-side. Add output_format: css_extractor and a css_selectors map. The response comes back as structured JSON in data.css_extracted — no Beautiful Soup step needed.

This is the fastest path for production pipelines: your Python worker receives JSON, validates it, and writes to Postgres or S3. Less code, fewer places for layout changes to break you silently.

structured_scrape.py

python

12345678910111213141516171819resp = requests.post(
    "https://api.omniscrape.io/v1/scrape",
    headers={"X-API-Key": API_KEY},
    json={
        "url": TARGET,
        "mode": "auto",
        "output_format": "css_extractor",
        "css_selectors": {
            "title": "h1.product-name",
            "price": ".price-current",
            "rating": ".star-rating",
            "reviews": ".review-count",
        },
    },
    timeout=120,
)
data = resp.json()["data"]["css_extracted"]
print(data)
# {"title": "Wireless Earbuds", "price": "$79.99", "rating": "4.6", "reviews": "1,284"}

8.Pages that need JavaScript

Some sites ship an empty HTML shell and load prices or listings with React after the page opens. requests and the fast HTTP lane both return that empty shell. You need a real browser to execute JavaScript first.

Set mode to js_rendering and tell the API which element to wait for with js_wait_selector. The browser waits until that element appears in the DOM, then returns the fully rendered HTML. For more detail on when and why this happens, see scraping JavaScript-rendered pages.

js_render.py

python

12345678910111213resp = requests.post(
    "https://api.omniscrape.io/v1/scrape",
    headers={"X-API-Key": API_KEY},
    json={
        "url": "https://spa-store.com/products",
        "mode": "js_rendering",
        "output_format": "html",
        "js_wait_selector": ".product-card",
        "js_wait_timeout": 10000,
    },
    timeout=120,
)
html = resp.json()["data"]["content"]

9.Scale with httpx and asyncio

When you have thousands of URLs and each fetch takes one to five seconds through an API, sequential requests are too slow. httpx supports async out of the box — pair it with asyncio.Semaphore to cap concurrency (five to ten in-flight requests is a sensible starting point).

The pattern below is production-ready: bounded concurrency, per-URL error handling, and structured results you can pipe into a database writer.

async_scrape.py

python

1234567891011121314151617181920212223242526272829import asyncio
import httpx
import os

API_KEY = os.environ["OMNISCRAPE_KEY"]
URLS = ["https://example.com/p/1", "https://example.com/p/2"]  # your list

async def scrape_one(client: httpx.AsyncClient, url: str) -> dict:
    r = await client.post(
        "https://api.omniscrape.io/v1/scrape",
        headers={"X-API-Key": API_KEY},
        json={"url": url, "mode": "auto", "output_format": "css_extractor",
              "css_selectors": {"title": "h1", "price": ".price"}},
        timeout=120,
    )
    r.raise_for_status()
    return r.json()["data"].get("css_extracted", {})

async def main():
    sem = asyncio.Semaphore(5)
    async with httpx.AsyncClient() as client:
        async def bounded(url):
            async with sem:
                return await scrape_one(client, url)
        results = await asyncio.gather(*[bounded(u) for u in URLS], return_exceptions=True)
    for url, result in zip(URLS, results):
        print(url, result if not isinstance(result, Exception) else f"ERROR: {result}")

asyncio.run(main())

10.Handle API errors properly

The OmniScrape API returns specific HTTP status codes. Treat them differently in your retry logic:

200 + success:true — got data; parse and save
401 — bad API key; fix your env var, do not retry
402 — out of balance; top up account, alert your team
429 — sending too fast; sleep with exponential backoff, then retry
502 — worker temporarily busy; retry up to 3 times with jitter
200 + success:false — page-level failure (404 on target, empty render); log URL, send to dead-letter queue

Frequently asked questions

Should I use requests or httpx for web scraping?

requests for scripts and notebooks where simplicity matters. httpx when you need async concurrency or HTTP/2. Both work identically with the OmniScrape API — you POST JSON and read JSON back.

Do I still need Beautiful Soup if I use css_extractor?

Not for the fields you define in css_selectors — OmniScrape returns them as JSON. Keep Beautiful Soup for cases where you need to traverse complex DOM structures, extract tables, or archive raw HTML for re-parsing later.

How do I scrape a site that requires login?

For pages behind authentication you control, use Browser-as-a-Service and script the login flow with Playwright. For public pages behind bot protection, Web Unlocker with mode:auto is enough. Do not scrape private user data you are not authorized to access.

What is the cheapest way to scrape with Python and OmniScrape?

Use mode:auto so simple pages hit the fast lane (~$0.0035/request). Use css_extractor to skip parsing code. Reserve js_rendering for pages that genuinely need JavaScript. Check metadata.method_used in each response to see what you were charged for.

Can I use Scrapy instead of a raw loop?

Yes. Write a custom Scrapy download middleware that POSTs each request URL to OmniScrape and returns the HTML. Scrapy handles concurrency, retries, and exports; OmniScrape handles bot protection. See our Scrapy scraping guide for the middleware pattern.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.