1.When to use Beautiful Soup
Prototypes, pipelines already in pandas, forgiving parse of broken HTML, teams learning CSS selectors. Pair with lxml parser for speed on large documents.
- Quick selector prototyping in Jupyter
- Legacy codebases already on bs4
- Malformed HTML from third-party templates
- Post-OmniScrape parse when css_extractor is not enough
2.Where the stack breaks (fetch, not Soup)
Soup never sees product prices if fetch returned challenge HTML or empty SPA shell. No amount of .select() fixes that — upgrade fetch to OmniScrape js_rendering or fix js_wait_selector.
- Empty soup.select results on JS-only sites
- No concurrency — bring your own loop or Scrapy
- Slower than selectolax on 5MB documents
- No built-in URL discovery
3.Pattern A — OmniScrape fetch + Beautiful Soup
Default production pattern. POST url, get data.content, BeautifulSoup(html, 'lxml'), select rows, validate non-empty fields before DB write.
12345678910111213141516171819202122232425262728293031323334import os
import requests
from bs4 import BeautifulSoup
API_KEY = os.environ["OMNISCRAPE_KEY"]
def scrape_products(url: str) -> list[dict]:
r = requests.post(
"https://api.omniscrape.io/v1/scrape",
headers={"X-API-Key": API_KEY},
json={
"url": url,
"mode": "auto",
"output_format": "html",
"js_wait_selector": ".product-card",
},
timeout=120,
)
r.raise_for_status()
body = r.json()
if not body["success"]:
raise RuntimeError(body)
soup = BeautifulSoup(body["data"]["content"], "lxml")
items = []
for card in soup.select(".product-card"):
title = card.select_one("h2")
price = card.select_one(".price")
if not title or not price:
continue
items.append({
"title": title.get_text(strip=True),
"price": price.get_text(strip=True),
})
return items
4.When to skip Soup entirely
Stable fields map to css_extractor — OmniScrape returns JSON, no Soup step. Keep Soup for tables, nested traversal, and JSON-LD script tags.
12345678# Same URL — structured path
body = {
"url": url,
"mode": "auto",
"output_format": "css_extractor",
"css_selectors": {"title": "h2", "price": ".price"},
}
# items = r.json()["data"]["css_extracted"] # dict, not list — adapt for lists
5.Parsing JSON-LD with Soup
Product schema in script tags often survives CSS redesigns longer than class-based selectors.
123456import json
for tag in soup.select('script[type="application/ld+json"]'):
data = json.loads(tag.string)
if data.get("@type") == "Product":
print(data.get("offers", {}).get("price"))
6.Pattern B — when Soup is not enough
Infinite scroll, hover prices, and login walls need Playwright BaaS. After navigation completes, page.content() → Soup if you prefer selectors over locators.
123456789101112131415from playwright.async_api import async_playwright
async def baas_then_soup():
async with async_playwright() as p:
browser = await p.chromium.connect_over_cdp(
f"wss://browser.omniscrape.io?apikey={os.environ['OMNISCRAPE_KEY']}&render_media=false"
)
page = await browser.new_page()
await page.goto("https://protected.example/catalog")
await page.click("#load-more")
await page.wait_for_selector(".product-card")
html = await page.content()
await browser.close()
soup = BeautifulSoup(html, "lxml")
return [c.get_text(strip=True) for c in soup.select(".product-card .title")]
7.Validate before save
Empty string price poisons warehouses. Assert required fields; send failures to dead-letter queue with saved HTML snippet.
8.Archive HTML for reproducibility
Write data.content to S3 before parsing — when selectors break Friday night, diff HTML without re-scraping.
9.Performance tips
Use lxml parser. For huge docs consider selectolax. Parse in worker pool if CPU-bound after fetch.
10.Checklist
Soup is parse layer only — invest in fetch quality first.
- Confirm fetch success before Soup
- Prefer data-testid selectors over hashed classes
- Log metadata.method_used for cost
- Try css_extractor before writing 200 lines of parse
Frequently asked questions
Beautiful Soup vs css_extractor?
css_extractor for flat fields on stable templates. Soup for tables, JSON-LD, and complex traversal.
lxml vs html.parser?
lxml faster and stricter — install lxml for production.
Why empty select results?
Fetch problem first — print len(html) and check for challenge markers.
Can Soup run JavaScript?
No. OmniScrape js_rendering renders JS before HTML reaches Soup.
Soup with Scrapy?
Use Scrapy selectors in spiders, or Soup on OmniScrape middleware HtmlResponse — see Scrapy guide.
Related guides