Lead Generation Web Scraping: Compliant Inbound Enrichment for Sales Teams

1.Industry workflow: inbound enrichment

Enrichment is triggered the moment an inbound lead hits a webhook. The first step is normalizing a company domain out of the work email — a personal Gmail address should immediately disqualify the record from automated enrichment and route to manual review, since chasing the wrong target wastes SDR time and distorts pipeline data. Once a valid corporate domain is confirmed, the system enqueues a small, fixed set of that company's own URLs: homepage, /about, and /contact. OmniScrape fetches each in parallel, extracting the Organization JSON-LD block plus a handful of visible CSS fields. Those results merge with any configured fallback providers, pass through a mandatory email-verification stage, and are written into CRM custom fields where SDR sequences can read them.

EU leads route through a stricter path: a tighter source allowlist, a shorter data-retention TTL, and a pre-flight check against the opt-out suppression table before any request is made. The whole journey is designed to complete in well under sixty seconds so a rep working a fresh inbound lead sees enriched data before they compose their first message. Speed matters commercially, but it never justifies widening the source list beyond pages the company publishes about itself. The moment the pipeline starts reaching into third-party directories as a primary source, freshness and defensibility both degrade.

2.Example data schema

Every enriched record carries a match_confidence score so downstream automation only fires above a threshold you control — the difference between helpful enrichment and an embarrassing mis-merge that routes a prospect to the wrong industry sequence. Store the source domain and enriched_at timestamp so RevOps can measure freshness and trace any field back to exactly where it came from. Keep fields that were genuinely not found as null rather than guessing: a confidently wrong employee band is worse for SDR trust than an honest blank, because reps quickly learn to distrust a system that fills fields with plausible-sounding fiction.

The linkedin_url field is intentionally nullable and populated only when the company's own site links to their page — never from scraping LinkedIn member search directly. The employee_band field uses ranges rather than point estimates because public sources rarely agree on a precise headcount, and a band is both more honest and more actionable for routing logic.

enriched lead row

json

12345678910111213141516{
  "lead_id": "sf_00Q5g00000abc",
  "company_domain": "acme.io",
  "company_name": "Acme IO",
  "industry": "B2B SaaS",
  "employee_band": "51-200",
  "hq_country": "US",
  "public_email": "hello@acme.io",
  "phone": "+1-555-0100",
  "linkedin_url": null,
  "enriched_at": "2026-06-23T09:15:00Z",
  "match_confidence": 0.92,
  "sources": ["https://acme.io/about", "https://acme.io/contact"],
  "email_verified": true,
  "opt_out_checked": true
}

3.Example API request

The css_extractor request below pulls the visible company name, meta description, and any tel: or mailto: links from an about page in a single call, covering the common case cheaply without a full HTML parse. Mode auto keeps a simple marketing site on the fast HTTP lane and only escalates to headless browser rendering when the site's bot-protection or JavaScript assembly requires it, so per-lead cost stays predictable at scale. In practice you should also fetch the page as html in a second call and parse the application/ld+json Organization block, because that structured data is usually richer and more stable than the visible DOM — a pattern that pairs naturally with parsing in Python as described in Beautiful Soup web scraping.

Set enable_solver: true when targeting company sites that sit behind Cloudflare or similar challenge pages — the OmniScrape Web Unlocker handles the challenge transparently and returns the page content as normal. For sites that assemble contact details via JavaScript after page load, switch mode to js_rendering and add a js_wait_selector targeting the element that appears once the contact block renders.

company about page — css_extractor

json

12345678910111213141516POST https://api.omniscrape.io/v1/scrape
X-API-Key: YOUR_KEY

{
  "url": "https://acme.io/about",
  "mode": "auto",
  "output_format": "css_extractor",
  "enable_solver": true,
  "css_selectors": {
    "company_name": "h1",
    "description": "meta[name='description']",
    "phone": "a[href^='tel:']",
    "public_email": "a[href^='mailto:']",
    "address": "[itemprop='address']"
  }
}

4.Pipeline design

A CRM webhook kicks off domain extraction, which fans out into a bounded URL queue covering the homepage, /about, and /contact pages of the target company. OmniScrape fetches each in parallel; a JSON-LD parser extracts the schema.org Organization block while a CSS-field extractor pulls visible firmographic signals. A confidence scorer weighs how well the two sources agree — high agreement lifts the score, contradictions lower it. An email-verification SaaS then validates any scraped address against MX records and mailbox existence before the record is written. Only after verification does the pipeline update the CRM, and a record that clears both a high confidence threshold and a verified email additionally creates an SDR task so reps act while the lead is warm.

A separate opt-out table suppresses future enrichment for any contact who has exercised a GDPR or CCPA deletion or opt-out request, and the pipeline checks it as the very first step before making any external requests. Because the work per lead is small and bounded — three pages, a parse, a verification call — the system runs comfortably in near real time on standard serverless infrastructure without queuing complexity. Keeping the source list short and the opt-out check mandatory is what makes the pipeline both fast and legally defensible. Any proposal to add a fourth or fifth source type should go through the same governance review as the initial design.

5.What not to scrape

The boundaries here are not stylistic preferences; they are the difference between a sales tool and a lawsuit. Do not scrape LinkedIn member profiles at scale without authorization — their terms of service restrict automated access and scaling profile extraction invites both technical blocks and litigation. Do not buy or ingest breach-derived contact lists, even indirectly through data brokers who obscure their sourcing. Do not bypass login walls on gated directories to extract data that is deliberately not public. The defensible surface is public marketing sites and the schema.org data companies publish about themselves.

Anything beyond that surface needs explicit written sign-off from counsel before a single request is made. When a growth team pushes to widen the net — "can we just pull from this directory?" — the right response is to make them obtain that approval in writing rather than quietly enabling it at the engineering level. Document the source allowlist in your pipeline configuration and treat changes to it as requiring the same review as a new data-processing activity under GDPR Article 30. The pipeline should make it structurally difficult to add sources without that review, not merely discouraged.

6.The stale directory problem

Third-party directories lag reality badly — companies rebrand, move headquarters, change size bands, and shut down faster than aggregators update their records. A directory entry for a 200-person company may reflect headcount from two funding rounds ago, and routing a deal to an enterprise sequence based on that number wastes both SDR time and sequence budget. Prefer the firm's own about and contact pages as the primary source, and use directories only as a fallback or cross-check when the primary site is thin or unstructured.

Refresh enrichment on a defined cadence — every ninety days at minimum, or whenever a contact re-enters the pipeline as a fresh inbound — so the CRM does not slowly fill with confidently outdated facts. Freshness is a metric RevOps tracks precisely because stale enrichment erodes rep trust quietly: reps start ignoring enriched fields, then stop reporting on them, then the program loses its business case. The enriched_at timestamp in the schema exists specifically so automated jobs can identify records due for a refresh without manual auditing.

7.Metrics to track

Match rate tells you how often enrichment actually adds something, but SDR acceptance rate is the metric that matters most operationally — field coverage is worthless if reps have learned to ignore the data because it has been wrong too many times. Survey reps quarterly and watch whether they override enriched fields manually; a high override rate is a leading indicator that match quality is degrading before bounce rates catch up.

Post-verification bounce rate is the proof that your email-verification handoff is doing its job rather than passing through dead addresses with a false-positive result. GDPR opt-out processing time is simultaneously an operational and a compliance metric: a slow opt-out path is a regulatory risk, not just a backlog item, and regulators treat it accordingly. Track cost per enriched lead at the field level — some fields like employee band may cost significantly more per accurate result than others, and that ratio should inform which fields are worth maintaining.

Match rate on inbound leads (fields filled / leads processed)
Data freshness (days since last enrichment, by field)
Email bounce rate post-verification
GDPR and CCPA opt-out processing time (target: under 72 hours)
Cost per enriched lead (scrape credits + verification API calls)
SDR acceptance rate (do reps trust and act on the enriched data?)
Confidence score distribution (what fraction of records clear your automation threshold?)

8.Obfuscated contacts

A growing number of companies render contact emails as images, split addresses across DOM nodes with CSS reassembly, or construct them with JavaScript specifically to defeat automated extraction. Attempting to defeat those techniques — OCR on email images, DOM reconstruction, JavaScript execution to capture assembled strings — is technically possible but rarely worth the cost, noise, and the signal it sends about your intent. If a company has gone to the trouble of obfuscating their contact details, they have made a deliberate choice about how they want to receive outreach.

The pragmatic and compliant move is to skip obfuscated contacts entirely and route to the company's official contact form instead, which is the channel they have explicitly offered for inbound communication. Record the obfuscation flag in your pipeline so RevOps knows why the public_email field is null for certain records rather than assuming an enrichment failure. Respecting that obfuscation rather than engineering around it keeps the program on the right side of both ethics and terms of service, and avoids the arms-race dynamic that gets entire IP ranges blocked.

9.Email verification

A mailto: link scraped from a public page is a candidate address, not a verified contact. Writing unverified addresses to the CRM and triggering sequences against them poisons deliverability, tanks sender reputation with ESPs, and in volume can get your sending domain blacklisted — damage that takes months to recover from and affects every outbound motion, not just enrichment-sourced leads. Always pipe scraped addresses through a verification vendor that checks syntax validity, domain MX record existence, and mailbox-level acceptance before anything lands in Salesforce.

Record the verification result as a boolean field alongside match_confidence so SDR automation can gate on both independently. A high firmographic match with an unverified email should not trigger an automated send; it should create a task for manual review instead. Role-based addresses like info@ or hello@ often pass MX verification but have low deliverability and engagement rates — consider flagging them separately and routing to the contact form path rather than direct outreach. Treating verification as a mandatory pipeline stage rather than an optional polish step is what keeps the outbound machine healthy as enrichment volume scales.

10.Rollout

Start the program narrow — US-based B2B inbound only, a single SDR team as the consumer — so you can validate match rates and acceptance rates before taking on the heavier compliance load of additional regions or use cases. Instrument everything from day one: match rate by industry vertical, confidence score distribution, email bounce rate, and SDR override frequency. A two-week pilot on a controlled inbound segment gives you enough data to know whether the pipeline is earning trust before you commit to broader infrastructure.

EU and UK expansion should wait on an explicit legal review of lawful basis, retention periods, opt-out handling, and Article 30 documentation, because the compliance requirements there are materially stricter than a US-only pilot exposes you to. Do not treat EU expansion as a configuration change; treat it as a new data-processing activity that requires its own sign-off. Prove the pipeline earns rep trust on a controlled segment first, then widen scope deliberately with counsel in the room rather than flipping on every geography at once. A small enrichment program that reps trust and act on consistently beats a global one they have learned to ignore.

Frequently asked questions

Can I scrape LinkedIn for leads?

LinkedIn's terms of service restrict automated access to member data, and scaling profile scraping there invites both technical countermeasures and litigation — hiQ v. LinkedIn established some public-data precedent, but LinkedIn continues to enforce aggressively and the legal landscape remains unsettled. The compliant path for LinkedIn-sourced data is their official Sales Navigator API or authorized data partnerships. For enrichment purposes, you can legitimately capture a company's LinkedIn page URL if the company links to it from their own website, but do not scrape member profiles or search results. Treat large-scale LinkedIn member scraping as out of scope rather than an engineering problem to solve.

Is storing public company emails GDPR-compliant?

It depends on your lawful basis, the role of the contact, and the jurisdiction — there is no blanket yes. Business contact data for genuinely corporate roles (info@, press@) sits in a different category from named individual work emails, which are personal data under GDPR regardless of being publicly listed. Processing under legitimate interest is possible but requires a documented balancing test showing the processing does not override the individual's rights. EU and UK leads require a documented lawful basis, defined retention limits, a working opt-out mechanism, and Article 30 records. Involve counsel before enriching EU contacts at scale, and do not assume that "it was publicly posted" is a sufficient basis on its own.

Which pages give the best enrichment signal?

The /about and /contact pages, the homepage footer, and the JSON-LD Organization block in the page head carry the densest firmographic signal on most company sites. The structured JSON-LD is often richer and more stable than the visible DOM — it frequently includes legalName, numberOfEmployees, foundingDate, address, and sameAs links to social profiles — so parse it in addition to CSS fields rather than instead of them. Keep the source list short and company-owned. Third-party directories are useful as a cross-check when the primary site is thin, but they should never be the primary source of truth given how quickly they go stale.

How fast should enrichment run?

Aim for completion within sixty seconds of the inbound webhook firing, since SDR response time to fresh leads is a meaningful conversion factor and enrichment that arrives after the rep has already sent a generic opener adds less value. Three pages fetched in parallel at roughly two to three seconds each runs well under thirty seconds for the scrape phase; email verification adds another five to ten seconds depending on the vendor. Keep the per-lead work bounded — a fixed URL list, not an open-ended crawl — so latency stays predictable as inbound volume grows and you do not need to over-provision infrastructure for burst capacity.

Why attach a match_confidence score?

Because downstream automation should never auto-enroll or auto-route on a weak match. A confidence score lets you gate automated sends above a threshold — 0.85 is a reasonable starting point — and queue lower-confidence records for human review rather than silently mis-routing them. It also gives RevOps a tunable lever: raise the threshold to improve precision at the cost of coverage, lower it to increase coverage while accepting more manual review. Confidence turns enrichment from a blunt overwrite into a controllable input with observable quality characteristics. Track the distribution over time; a drifting distribution is an early signal that a source is degrading before bounce rates catch up.

How should I handle companies that use JavaScript to render their contact pages?

Switch the OmniScrape request to mode js_rendering and add a js_wait_selector targeting a DOM element that appears only after the contact block has fully rendered — for example, a[href^='mailto:'] or a specific section class. This tells the headless browser to wait for that element before returning the page, avoiding the common failure mode of extracting an empty contact section from a partially rendered page. js_rendering costs more per request than auto's fast lane, so apply it selectively: use mode auto first, inspect metadata.method_used in the response, and only hard-code js_rendering for domains you have confirmed require it.

What should I do when enrichment returns null for most fields?

First distinguish between a scraping failure and a genuinely thin site. Check the response body.success field and metadata.method_used to confirm the page was actually fetched and rendered. If the page loaded but fields are null, the company may use non-standard markup — inspect the raw HTML returned in body.data.content and adjust your css_selectors accordingly. If the site is genuinely sparse, fall back to the JSON-LD Organization block, which is often populated even when the visible page is minimal. As a last resort, route the lead to manual SDR research rather than writing null values that look like enrichment failures in reporting.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.Industry workflow: inbound enrichment

2.Example data schema

enriched lead row

json

12345678910111213141516{
  "lead_id": "sf_00Q5g00000abc",
  "company_domain": "acme.io",
  "company_name": "Acme IO",
  "industry": "B2B SaaS",
  "employee_band": "51-200",
  "hq_country": "US",
  "public_email": "hello@acme.io",
  "phone": "+1-555-0100",
  "linkedin_url": null,
  "enriched_at": "2026-06-23T09:15:00Z",
  "match_confidence": 0.92,
  "sources": ["https://acme.io/about", "https://acme.io/contact"],
  "email_verified": true,
  "opt_out_checked": true
}

3.Example API request

company about page — css_extractor

json

12345678910111213141516POST https://api.omniscrape.io/v1/scrape
X-API-Key: YOUR_KEY

{
  "url": "https://acme.io/about",
  "mode": "auto",
  "output_format": "css_extractor",
  "enable_solver": true,
  "css_selectors": {
    "company_name": "h1",
    "description": "meta[name='description']",
    "phone": "a[href^='tel:']",
    "public_email": "a[href^='mailto:']",
    "address": "[itemprop='address']"
  }
}

4.Pipeline design

5.What not to scrape

6.The stale directory problem

7.Metrics to track

Match rate on inbound leads (fields filled / leads processed)
Data freshness (days since last enrichment, by field)
Email bounce rate post-verification
GDPR and CCPA opt-out processing time (target: under 72 hours)
Cost per enriched lead (scrape credits + verification API calls)
SDR acceptance rate (do reps trust and act on the enriched data?)
Confidence score distribution (what fraction of records clear your automation threshold?)

8.Obfuscated contacts

9.Email verification

10.Rollout

Frequently asked questions

Can I scrape LinkedIn for leads?

Is storing public company emails GDPR-compliant?

Which pages give the best enrichment signal?

How fast should enrichment run?

Why attach a match_confidence score?

How should I handle companies that use JavaScript to render their contact pages?

What should I do when enrichment returns null for most fields?

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.