LLM Training Data Scraping: Building Clean Web Corpora

1.Industry workflow: from allowlist to shard

The pipeline starts with legal, not engineering. Counsel publishes a domain allowlist tiered by license confidence — tier A is explicitly licensed or public-domain, tier B is permissive-but-unreviewed, and everything else is excluded by default. Crawlers only ever read URLs from sitemaps of allowlisted domains, which keeps the discovery surface auditable and prevents an over-eager breadth-first crawl from wandering into a paywalled subdomain or a domain that has since revoked its license.

From there the flow is mechanical: discover article URLs from sitemaps, fetch raw HTML through OmniScrape, extract main text with trafilatura, run language detection, scrub PII, deduplicate with MinHash LSH, score with a safety classifier, and write parquet shards tagged with the source domain's license tier. Each shard lands in a dataset version registry so a future training mix can include or exclude domains without re-crawling. When a publisher updates its template and trafilatura yield drops below 60% of its historical baseline, the affected domain is automatically quarantined for manual review rather than silently poisoning the next mix.

Operationally, the most important property of this design is that the raw HTML archive is the canonical source of truth. Every downstream stage — extraction, language detection, PII filtering, dedup — reads from that archive and writes to the next stage's store. If any stage's logic changes, you replay from the archive rather than re-fetching from publishers. This keeps re-crawl costs near zero for iterative corpus development and makes the pipeline deterministic for a fixed archive snapshot.

2.Document schema

Every document row carries a license_tag and a simhash so that legal audits and dedup passes can both filter the same table without a join. The doc_id is a SHA-256 hash over the normalized extracted text, which means republished articles collapse to the same identity even when their URLs differ across syndication partners. Storing token_estimate at ingest lets you forecast a mix's size in tokens before you ever load it into the trainer, and it gives finance a concrete unit for cost-per-token reporting.

The pii_flag and safety_score fields are set by downstream filter stages and written back to the same row, so a single parquet scan can apply any combination of quality gates without joining multiple tables. The scraped_at timestamp is the fetch time, not the publication date — keep both if the publisher exposes a byline date, since training mixes sometimes want to weight by recency. The simhash field stores a 64-bit fingerprint over shingled token 3-grams, which is compact enough to load the full corpus index into memory for LSH lookups on a single machine even at hundreds of millions of documents.

training document row

json

123456789101112131415161718{
  "doc_id": "sha256:contenthash...",
  "url": "https://publisher.example/article/ai-trends",
  "domain": "publisher.example",
  "license_tag": "allowlist_tier_a",
  "language": "en",
  "title": "AI Trends 2026",
  "text": "Main article body without nav, footer, or cookie banners...",
  "token_estimate": 1842,
  "scraped_at": "2026-06-23T00:00:00Z",
  "pub_date": "2026-06-20T09:15:00Z",
  "pii_flag": false,
  "pii_span_count": 0,
  "safety_score": 0.02,
  "simhash": "abc123def456...",
  "extraction_yield": 0.74,
  "method_used": "fast"
}

3.Fetching article HTML with OmniScrape

For the fetch itself, request html rather than css_extractor — you want the full document body so trafilatura can make its own boilerplate decisions, and you do not want to maintain per-publisher CSS selectors across thousands of templates that change without notice. Most news and reference sites are server-rendered, so mode auto resolves to fast and keeps cost low. Reserve js_rendering for the handful of allowlisted publishers that render the article body client-side, where a js_wait_selector on the main content container is the difference between a full document and an empty skeleton.

Set a timeout of 30–60 seconds depending on your SLA tolerance. Log the metadata.method_used field from every response — it tells you which domains consistently escalate from fast to js_rendering, which is the primary cost signal for corpus budget planning. If a domain's method_used is consistently js_rendering but its content is not especially valuable, that is a data point for removing it from the allowlist in favor of a cheaper equivalent source.

The response HTML is in body.data.content. Write it verbatim to your raw archive keyed by doc_id before passing it to the extraction stage, so the archive is always a faithful copy of what OmniScrape returned and not a post-processed artifact.

article fetch request

json

12345678910POST https://api.omniscrape.io/v1/scrape
X-API-Key: YOUR_KEY
Content-Type: application/json

{
  "url": "https://publisher.example/article/ai-trends",
  "mode": "auto",
  "output_format": "html",
  "timeout": 45
}

4.End-to-end pipeline architecture

The production topology is an allowlist database feeding a polite sitemap crawler, which enqueues article URLs to a worker pool that calls OmniScrape and writes raw HTML to S3 under a legal-defined retention TTL. Each raw object is keyed by doc_id and tagged with the domain's license tier so the archive itself is auditable without reading the content. A separate trafilatura worker pool reads from that raw archive, applies language detection via langdetect or fastText, and writes extracted text records to a staging parquet layer.

From staging, a PII model flags sensitive spans and a safety classifier scores each document. Documents above the safety threshold or with a pii_flag are written to a quarantine partition rather than dropped, so rejection rates remain auditable and threshold changes can be applied retroactively without re-running the full pipeline. MinHash LSH deduplication runs as a batch job over the staged corpus before promotion to the final parquet shards, which are registered against a dataset version manifest that the training job pins by hash.

Orchestration typically runs on Airflow with one DAG per stage so a parsing regression can be replayed from the raw archive without re-hitting publishers. The raw S3 layer is the reproducibility anchor — if trafilatura ships a new release that changes extraction behavior, you re-run the extract stage over archived HTML rather than re-crawling, which both saves budget and keeps the corpus comparable across model generations. Each DAG run writes a lineage record that maps the output shard hashes back to the input archive hashes and the software versions used, giving you a full audit trail from training token to original URL.

5.Boilerplate removal and text extraction

Never train on raw HTML. Navigation bars, footers, related-article rails, cookie consent text, and sidebar widgets inflate token counts and teach the model the texture of web chrome instead of prose. trafilatura and readability-lxml both isolate the main article body reliably on most news templates; trafilatura tends to win on metadata extraction, edge-case layouts, and multilingual content, which is why it is the default in the example below.

Drop any document under roughly 150–200 tokens after extraction — those are almost always stubs, redirects, or pages where extraction failed silently, and they add noise without signal. Track per-domain extraction yield (extracted tokens divided by raw HTML bytes) as a time-series metric so a template change that quietly halves yield surfaces as an alert rather than a silent quality regression. A sudden drop in yield for a previously stable domain is almost always a sign that the publisher changed their page structure, not that the content got shorter.

If trafilatura returns None for a document, log it with the URL and raw HTML size so you can distinguish between a failed extraction on a real article versus a 404 page that slipped past your URL filter. A non-trivial rate of None returns on a domain is a signal to inspect that domain's sitemap for non-article URLs.

post-fetch extraction

python

12345678910111213141516171819202122232425262728import trafilatura

def html_to_text(html: str, url: str) -> str | None:
    """
    Extract main article text from raw HTML.
    Returns None if extraction fails or the result is too short
    to be a real article (stub, redirect, or extraction failure).
    """
    text = trafilatura.extract(
        html,
        url=url,
        include_comments=False,
        include_tables=False,
        no_fallback=False,
    )
    if not text or len(text.split()) < 150:
        return None
    return text


def extraction_yield(raw_html: str, extracted_text: str | None) -> float:
    """
    Ratio of extracted characters to raw HTML bytes.
    Values below ~0.05 on a previously stable domain signal a template change.
    """
    if not extracted_text or not raw_html:
        return 0.0
    return len(extracted_text) / max(len(raw_html.encode()), 1)

6.Deduplication at two levels

Deduplication runs at two levels because the failure modes are different. Exact SHA-256 over normalized text — lowercased, whitespace-collapsed, punctuation-stripped — catches verbatim republishes: the same wire story posted under three domains, or an article mirrored to an affiliate site. This check is cheap enough to run inline at extraction time and eliminates a large fraction of duplicates before the more expensive LSH pass.

MinHash LSH catches near-duplicates where a syndicated article has been lightly edited, prefixed with a different lede, or wrapped in a publisher's house style, which exact hashing misses entirely. A Jaccard threshold between 0.7 and 0.8 over 5-gram shingles works well for news content; tune it against a labeled sample of known-duplicate pairs from your allowlist rather than guessing. Too aggressive and you discard legitimately distinct coverage of the same event from different journalists; too loose and syndicated content survives and gets over-weighted in training, which can cause the model to reproduce specific phrasings verbatim.

Common Crawl and other open corpora already do dedup at scale, so if you are supplementing an existing dataset, run your new documents against the existing corpus's MinHash index, not just within the new batch. Keeping the MinHash index in a persistent store like Redis or a dedicated ANN index lets you add new batches incrementally without reprocessing the full corpus each time.

7.Corpus quality metrics

Legal reviews the license audit pass rate before every training run, and a single tier-A domain slipping to an unreviewed state can block the whole mix. Token yield per dollar is the metric that justifies the API spend to finance — a fast server-rendered news domain might cost a fraction of a cent per thousand extracted tokens, while a JS-heavy publisher that consistently escalates to js_rendering costs significantly more, and ranking domains by this number tells you where to invest extraction engineering effort and where to seek licensed data feeds instead.

The method escalation rate deserves particular attention because it is both a cost signal and a quality signal. A domain that suddenly starts escalating to js_rendering after months of fast fetches has probably added a JavaScript paywall or anti-bot layer; the content may still be extractable, but the economics have changed and the domain should be reviewed. Conversely, a domain with consistently low escalation rate and high extraction yield is a high-value, low-cost source worth prioritizing in the allowlist.

Corpus uniqueness ratio: post-dedup document count divided by pre-dedup count, tracked per batch and per domain
PII detection rate on a periodic stratified audit sample, broken down by entity type (email, phone, name, etc.)
Token yield per dollar: extracted tokens divided by sum of billing.charged across all fetches for the batch
License audit pass rate by tier: fraction of documents whose domain's license status is current and approved
Safety filter rejection rate: documents quarantined divided by documents processed, with breakdown by score bucket
Language distribution vs target mix: fraction of tokens per language against the mix specification
Extraction yield by domain: extracted characters per raw HTML byte, tracked as a time series to catch template changes
Method escalation rate: fraction of fetches where metadata.method_used was js_rendering rather than fast

8.Copyright, licensing, and provenance

Public accessibility of HTML does not equal permission to use content for model training, and treating reachability as license is how teams end up in litigation or forced to retrain. The allowlist exists precisely so that every token in the corpus traces back to a domain your counsel has cleared, whether that clearance comes from a public-domain status, a Creative Commons license, a robots.txt that explicitly permits training use, or a negotiated training-data agreement with the publisher.

Some publishers now license training corpora explicitly through data licensing programs, and those licensed feeds are usually cleaner, more consistently structured, and cheaper per token than building a scraper for the same content. Check for a licensed option before engineering a scraper for any large source — the engineering cost of a scraper plus ongoing maintenance often exceeds the licensing fee, and the licensed data comes with a much cleaner provenance trail.

Keep the license_tag immutable on each document row from the moment it is written. If a domain's license is later revoked or reclassified, you need to be able to surgically exclude the affected shards from future training mixes without rebuilding the corpus. A dataset version registry that maps shard hashes to domain lists makes this exclusion a query rather than a re-crawl. Document every license review decision with a timestamp and the reviewer's name so that the audit trail is human-readable, not just a database flag.

9.PII detection and toxic content filtering

PII filtering combines regex for structured patterns — email addresses, phone numbers, SSN-shaped strings, credit card formats, IP addresses — with a named-entity recognition model for contextual personal data that regex cannot catch, such as a person's name paired with their employer and salary in a news article. Run the filter on extracted text, not raw HTML, so you are not flagging template artifacts like support email addresses in page footers, and store a pii_flag plus the span offsets so a reviewer can audit decisions without re-running the model on the full document.

A safety classifier scores each document for slurs, graphic violence, self-harm content, and other material you want kept out of pretraining. Documents above the threshold are written to a quarantine partition rather than deleted outright, so the rejection rate stays auditable and you can adjust the threshold and re-classify without losing the original documents. Calibrate the threshold conservatively for pretraining corpora — it is far cheaper to drop a borderline document than to discover the model learned something you cannot easily unlearn, and the quarantine partition gives you a labeled dataset for threshold calibration if you collect human judgments on a sample.

For multilingual corpora, run both PII and safety models in the document's detected language rather than translating to English first. Translation introduces errors that can both cause false positives (flagging content that is benign in context) and false negatives (missing content that a native speaker would immediately recognize as problematic). If you do not have a safety model for a given language, treat that language as requiring manual review before inclusion rather than defaulting to pass.

10.Cost optimization for corpus-scale fetching

Most allowlisted publishers are server-rendered, so leaving mode on auto keeps the vast majority of fetches on the fast path and the bill predictable. Batch crawls during off-peak windows where your scheduler allows, and log metadata.method_used on every response so you can identify the handful of domains that consistently escalate to js_rendering and decide whether their content justifies the premium. A domain that costs ten times as much per fetch and delivers the same token yield as a cheaper alternative is a candidate for removal from the allowlist.

For scale, fan out with an async worker pool rather than a single synchronous loop — the patterns in the httpx scraping guide handle connection pooling and backpressure cleanly. Cap in-flight requests per domain to stay polite and avoid triggering rate limits that would force retries and inflate cost. Use exponential backoff with jitter on 429 and 503 responses rather than a fixed retry interval, which tends to cause thundering-herd retries that make the rate limiting worse.

Re-extracting from the S3 archive instead of re-crawling is the single largest cost lever once a corpus is established, since extraction is free compute and fetches are not. Structure your pipeline so that the raw archive is always the input to the extraction stage, and resist the temptation to skip archiving to save storage costs — at typical object storage prices, the storage cost of the raw archive is a small fraction of the re-crawl cost for any corpus of meaningful size. The archive also lets you A/B test extraction configurations without touching publishers, which is valuable when evaluating a new version of trafilatura or a different extraction library.

Frequently asked questions

Can I scrape the entire web for LLM training data?

Technically you can crawl broadly, but doing so creates both legal and quality liabilities that are difficult to unwind after the fact. Serious teams work from explicit domain allowlists and licensed sources so every token has a defensible provenance, and they supplement with open corpora like Common Crawl, The Pile, or RedPajama rather than reinventing a noisy general crawl. A targeted allowlist corpus with high extraction yield is usually more valuable per token than a broad crawl with poor quality control.

Why use OmniScrape instead of a raw HTTP client like curl or httpx?

Even allowlisted publishers sit behind Cloudflare, Akamai, or aggressive rate-limit policies that a naive HTTP loop trips within minutes. OmniScrape handles TLS fingerprinting, browser challenge solving, and IP rotation transparently, so you spend engineering time on extraction and dedup rather than maintaining a curl_cffi or Playwright fingerprinting layer. The metadata.method_used field also gives you a per-domain signal on whether fast or js_rendering was needed, which directly informs cost forecasting.

Should I store the raw HTML, or just the extracted text?

Store the raw HTML, at least for the duration your legal team specifies as the retention TTL. The raw archive is your reproducibility anchor: when trafilatura ships a new release, when you want to try a different extraction library, or when a quality audit reveals that a filter was too aggressive, you replay from the archive rather than re-crawling publishers. The storage cost at typical object storage rates is a small fraction of the re-crawl cost for any corpus of meaningful size.

What about Wikipedia, Common Crawl, and other open datasets?

For bulk general text, prefer existing open corpora — they are pre-cleaned, well documented, come with established license terms, and avoid redundant crawling pressure on origin sites. Use OmniScrape for the supplemental niche allowlist domains that those corpora cover poorly or not at all, such as specialized industry publications you have licensed, recent content that post-dates the last Common Crawl snapshot, or domains whose content quality is high enough to justify the fetch cost.

How do I handle JavaScript-rendered article content?

Set mode to js_rendering and use js_wait_selector pointing at the main article container — for example, 'article.post-content' or 'div[data-testid="article-body"]'. This tells OmniScrape to wait until that element is present in the DOM before returning the HTML, which ensures trafilatura receives a fully hydrated document rather than a skeleton. Log which domains require js_rendering so you can track the cost premium and decide whether each one belongs on the allowlist. See scraping JavaScript-rendered pages for a detailed walkthrough.

How do I run deduplication at corpus scale without loading everything into memory?

Use MinHash LSH with a persistent index stored in Redis or a dedicated approximate nearest neighbor store. Compute the MinHash signature for each new document at extraction time and query the index before writing to the staged corpus. This lets you add new batches incrementally without reprocessing the full corpus. For the initial corpus build, datasketch's LSHForest or a distributed implementation on Spark handles hundreds of millions of documents without requiring the full signature matrix in memory on a single machine.

How do I run the fetch pipeline at scale without overwhelming publishers or the API?

Use an async worker pool with per-domain concurrency limits — typically one to three concurrent requests per domain — and a shared queue that fans out across many domains in parallel. This keeps per-domain request rates polite while maximizing overall throughput. Implement exponential backoff with jitter on 429 and 503 responses. The patterns in the httpx web scraping guide cover connection pooling and backpressure handling in detail. For very large allowlists, partition the domain list across multiple worker processes and use a distributed queue like SQS or Pub/Sub to coordinate work.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.Industry workflow: from allowlist to shard

2.Document schema

training document row

json

123456789101112131415161718{
  "doc_id": "sha256:contenthash...",
  "url": "https://publisher.example/article/ai-trends",
  "domain": "publisher.example",
  "license_tag": "allowlist_tier_a",
  "language": "en",
  "title": "AI Trends 2026",
  "text": "Main article body without nav, footer, or cookie banners...",
  "token_estimate": 1842,
  "scraped_at": "2026-06-23T00:00:00Z",
  "pub_date": "2026-06-20T09:15:00Z",
  "pii_flag": false,
  "pii_span_count": 0,
  "safety_score": 0.02,
  "simhash": "abc123def456...",
  "extraction_yield": 0.74,
  "method_used": "fast"
}

3.Fetching article HTML with OmniScrape

article fetch request

json

12345678910POST https://api.omniscrape.io/v1/scrape
X-API-Key: YOUR_KEY
Content-Type: application/json

{
  "url": "https://publisher.example/article/ai-trends",
  "mode": "auto",
  "output_format": "html",
  "timeout": 45
}

4.End-to-end pipeline architecture

5.Boilerplate removal and text extraction

post-fetch extraction

python

12345678910111213141516171819202122232425262728import trafilatura

def html_to_text(html: str, url: str) -> str | None:
    """
    Extract main article text from raw HTML.
    Returns None if extraction fails or the result is too short
    to be a real article (stub, redirect, or extraction failure).
    """
    text = trafilatura.extract(
        html,
        url=url,
        include_comments=False,
        include_tables=False,
        no_fallback=False,
    )
    if not text or len(text.split()) < 150:
        return None
    return text


def extraction_yield(raw_html: str, extracted_text: str | None) -> float:
    """
    Ratio of extracted characters to raw HTML bytes.
    Values below ~0.05 on a previously stable domain signal a template change.
    """
    if not extracted_text or not raw_html:
        return 0.0
    return len(extracted_text) / max(len(raw_html.encode()), 1)

6.Deduplication at two levels

7.Corpus quality metrics

Corpus uniqueness ratio: post-dedup document count divided by pre-dedup count, tracked per batch and per domain
PII detection rate on a periodic stratified audit sample, broken down by entity type (email, phone, name, etc.)
Token yield per dollar: extracted tokens divided by sum of billing.charged across all fetches for the batch
License audit pass rate by tier: fraction of documents whose domain's license status is current and approved
Safety filter rejection rate: documents quarantined divided by documents processed, with breakdown by score bucket
Language distribution vs target mix: fraction of tokens per language against the mix specification
Extraction yield by domain: extracted characters per raw HTML byte, tracked as a time series to catch template changes
Method escalation rate: fraction of fetches where metadata.method_used was js_rendering rather than fast

8.Copyright, licensing, and provenance

9.PII detection and toxic content filtering

10.Cost optimization for corpus-scale fetching

Frequently asked questions

Can I scrape the entire web for LLM training data?

Why use OmniScrape instead of a raw HTTP client like curl or httpx?

Should I store the raw HTML, or just the extracted text?

What about Wikipedia, Common Crawl, and other open datasets?

How do I handle JavaScript-rendered article content?

How do I run deduplication at corpus scale without loading everything into memory?

How do I run the fetch pipeline at scale without overwhelming publishers or the API?

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.