1.Industry workflow: freshness over completeness
The crawl cadence is split by source priority. High-value company career sites — Greenhouse, Lever, and direct careers subdomains — get polled hourly so a new requisition surfaces in the product within minutes of going live. Aggregator category pages are crawled daily to sweep up the long tail without burning request budget on low-signal sources. From each posting the pipeline extracts title, location, employment type, posted date, and the full description text, then hashes the normalized description so the same role surfacing on a second site can be recognized and collapsed as a duplicate.
Lifecycle management is what keeps the index honest. A posting not seen in 14 days is marked stale; one absent for 21 days is removed from the customer-facing API unless a direct verification fetch confirms it is still live. This TTL discipline matters because labor-market products are judged on freshness — a customer who sees a 'new' posting that was actually filled last month stops trusting the entire dataset. An aggressive expiry policy that occasionally removes a live posting beats a comprehensive but stale index every time, because false negatives are invisible while false positives erode trust visibly.
Seed management is the third pillar. Maintain a curated list of company ATS base URLs rather than discovering them dynamically, so the crawler knows exactly which Greenhouse subdomain belongs to which company and can attribute postings correctly from the start. New companies are added to the seed list as a deliberate editorial decision, not through opportunistic link-following, which keeps the coverage boundary auditable.
2.Posting data schema
The row centers on two identifiers: a stable job_id pulled from the board's URL when the platform exposes one, and a description_hash for cross-aggregator deduplication. Greenhouse and Lever embed clean numeric IDs in their URL paths, making them ideal canonical anchors. Messier boards fall back to a deterministic hash of company, title, location, and posted date — slightly less reliable across reposts, but workable. Keep salary_raw as the original string rather than a parsed range: compensation formats vary wildly across geographies and industries, and premature parsing destroys information you will want later when building salary-band analytics.
The is_active flag is updated on every verification pass, not just on ingestion. Storing scraped_at separately from posted_at lets you compute crawl lag — the delta between when a posting went live and when your pipeline first captured it — which is the most honest measure of how fresh your index actually is. Store description_text in full in object storage and only a truncated version in the primary database; NLP reprocessing runs against the full text, not the truncated copy.
12345678910111213141516171819202122{
"job_id": "greenhouse_442981",
"source": "company_careers",
"ats_platform": "greenhouse",
"company_name": "Acme Analytics",
"company_domain": "acme.com",
"title": "Senior Data Engineer",
"location": "Remote - US",
"employment_type": "full_time",
"posted_at": "2026-06-20",
"description_text": "...",
"description_hash": "sha256:abc123...",
"salary_raw": "$160k–$190k",
"url": "https://careers.acme.com/jobs/442981",
"canonical_url": "https://boards.greenhouse.io/acme/jobs/442981",
"scraped_at": "2026-06-23T08:00:00Z",
"crawl_lag_minutes": 47,
"is_active": true,
"skills_extracted": ["Python", "Apache Spark", "dbt", "Kubernetes"],
"seniority": "senior",
"role_family": "data_engineering"
}
3.OmniScrape API request for career page extraction
The css_extractor output format handles the common case cleanly. Most applicant tracking systems use predictable, stable class names on their hosted job detail pages, and pinning a js_wait_selector to the description container ensures the full text has rendered before extraction begins. Greenhouse and Lever pages are server-rendered, so mode auto keeps them on the fast HTTP lane and the per-request cost low. Log metadata.method_used per source domain — when a source that previously resolved via fast starts returning js_rendering, it is an early signal that the site has added client-side rendering or a bot-detection layer that warrants investigation.
Workday is the notable exception to the server-rendered norm. Its portals are heavily client-rendered and often require mode js_rendering with a js_wait_selector pointed at the job description container, plus a residential proxy matched to the job's country. Budget for a higher per-request cost on Workday sources and track it separately so it does not distort your blended cost-per-posting metric. The response HTML is in body.data.content for html output format and structured fields are in body.data.css_extracted for css_extractor.
123456789101112131415161718192021POST https://api.omniscrape.io/v1/scrape
X-API-Key: YOUR_API_KEY
Content-Type: application/json
{
"url": "https://careers.acme.com/jobs/442981",
"mode": "auto",
"output_format": "css_extractor",
"css_selectors": {
"title": "h1.job-title",
"location": ".job-location",
"posted_at": "time[datetime]",
"employment_type": ".employment-type",
"description": ".job-description",
"salary_raw": ".salary-range",
"apply_url": "a.apply-button"
},
"js_wait_selector": ".job-description",
"proxy": "residential:us",
"enable_solver": true
}
4.End-to-end pipeline architecture
The flow starts from a curated seed list of company ATS base URLs, expands into board URL patterns per platform, and runs a polite paginated crawl of listing pages that enqueues individual posting detail URLs. Worker processes call OmniScrape to extract each posting via css_extractor, writing raw responses to object storage before any transformation. The dedup stage collapses on job_id plus description_hash, a spaCy-based skills tagger enriches the description text, and the result lands in an Elasticsearch index fronted by a GraphQL API for customers. Orchestration on Airflow keeps the hourly career-site DAG and the daily aggregator DAG separate so a slow aggregator crawl never delays the high-priority refresh.
Salary behind authentication is a hard boundary, not an optimization target. The pipeline takes public salary ranges where they are displayed in the DOM and skips the field otherwise — it does not attempt to bypass an apply-login to harvest hidden compensation, because that crosses into unauthorized access regardless of technical feasibility. Raw description_text is retained in object storage so that when the skills taxonomy or NLP model improves, the entire historical corpus can be re-enriched without re-scraping a single page. This retain-raw discipline is the single most valuable architectural decision a pipeline can make in its first week.
Error handling deserves explicit design. Distinguish between a transient 429 (back off and retry with jitter), a 404 on a posting you previously saw (mark inactive, do not retry), and a structural parse failure where css_extractor returns empty fields (alert and route to a manual selector review queue). Conflating these three failure modes into a single retry loop is how pipelines accumulate silent data gaps that only surface when a customer asks why a specific company's postings disappeared.
5.Cross-aggregator deduplication
The same Senior Data Engineer role on LinkedIn, Indeed, and the company's own careers page shares a description_hash but carries three different URLs. Resolving that to a single canonical posting is the core dedup problem. The rule that works in practice: prefer the company career page as canonical whenever it exists, since it is the authoritative source and the least likely to vanish behind an aggregator's access restrictions or ToS change. When no direct career page is found, prefer the aggregator with the most complete structured data.
Hashing the description alone is insufficient — aggregators sometimes append their own boilerplate, truncate the text at different lengths, or reformat whitespace. Normalize before hashing: strip leading and trailing whitespace, collapse internal whitespace runs, lowercase the entire string, and strip known aggregator footer patterns. Keep a list of those footer patterns as a maintained artifact, because aggregators update their templates and a footer that was stripped last year may have changed form.
Track a false-duplicate rate from a weekly manual sample of 50–100 postings flagged as duplicates. An over-aggressive normalizer that merges two genuinely different roles — a company that posted two separate Senior Data Engineer openings on the same day in different cities — is harder to detect than one that misses a duplicate, and the downstream effect on hiring-velocity metrics is worse. Set a target false-duplicate rate below 0.5% and treat exceedances as pipeline bugs.
6.Deep pagination and incremental crawling
Aggregator search results paginate far deeper than is useful, and crawling to page 50 wastes request budget on stale, low-relevance postings that were already captured in a previous run. Cap pages per query at a sensible depth — typically 5 to 10 pages — and lean on posted_since URL parameters when the board supports them, so an hourly run fetches only what changed rather than re-walking the entire result set. Most major boards expose a date filter; document which ones do and which do not, because the crawl strategy differs significantly.
Where a board exposes no date filter, use the first page's posting dates as a stopping signal. Once you encounter postings older than the timestamp of your last successful crawl for that query, there is nothing new beyond that point and you can stop paginating. This incremental approach keeps the request count proportional to genuinely new postings rather than total catalog size, which matters enormously when you are running hundreds of queries across dozens of sources.
For listing pages that load results via infinite scroll rather than numbered pages, use mode js_rendering with a js_wait_selector targeting the last listing card, then trigger scroll events via the js_wait_timeout to load additional batches. Cap the scroll depth the same way you would cap page depth — the goal is fresh coverage, not exhaustive re-crawl.
7.Operational metrics to instrument
For trading-signal and recruiting intelligence products, time-to-index is the headline metric — a hiring signal that arrives a day late has already been acted on by competitors, so it has near-zero marginal value. Track it at p50 and p95 separately; a median of 20 minutes with a p95 of 8 hours means a meaningful fraction of postings are arriving too late to be useful, even if the median looks healthy.
The stale job rate is the trust metric. When postings flagged active start returning 404 on verification, the expiry logic is lagging reality and customers will notice before your dashboards do. Wire an alert to it at a threshold of 2–3% rather than reviewing it in a weekly report — by the time it shows up in a weekly review, a customer has likely already complained. The cost-per-posting metric broken out by platform is what tells you whether Workday's js_rendering overhead is worth the coverage it provides versus the marginal cost of excluding it.
- Time-to-index new postings (median minutes from publish to API availability)
- Crawl lag by source domain (p50 and p95 separately)
- False duplicate rate (weekly manual sample, target below 0.5%)
- Geographic coverage by metro (postings per city vs. target coverage list)
- Source reliability score (uptime × extract success rate per domain)
- Stale job rate (is_active true but HTTP 404 on verification fetch)
- Cost per 1,000 postings ingested (broken out by ATS platform)
- NLP enrichment coverage (% of postings with at least one skill extracted)
8.NLP enrichment pipeline
The description text is where the product value concentrates. The enrichment layer extracts skills, seniority, and role family using spaCy with a custom NER model and a taxonomy that maps to standards like SOC codes and O*NET occupational categories. Mapping surface variation — 'proficient in Python', 'Python 3.x experience', 'strong Python background' — to a single canonical skill entity is what lets customers query demand trends without drowning in lexical noise. Maintain the taxonomy as a versioned artifact with a changelog, because adding a new skill or renaming a category is a breaking change to any customer query that references it by name.
Seniority classification deserves its own model rather than a keyword list. Title-based heuristics ('Senior', 'Staff', 'Principal') miss the substantial fraction of postings where seniority is expressed only in the requirements section — years of experience, scope of responsibility, reporting structure. A classifier trained on labeled postings outperforms keyword matching significantly on this signal, and seniority is one of the dimensions customers filter on most heavily.
Always retain description_text in full alongside the enriched fields. Models and taxonomies improve, and you will want to reprocess the historical corpus rather than re-scrape it when you upgrade the tagger. The teams that discard raw text after extraction end up re-crawling years of postings every time they improve the NLP layer — slow, expensive, and an unnecessary load on source sites. The same retain-raw discipline appears in market research scraping, where reprocessable source text is what lets one corpus serve many evolving analytical questions.
9.Politeness, rate limits, and proxy strategy
Company career sites are small infrastructure relative to a search engine. A single company's Greenhouse instance can be degraded or provoked into IP-blocking by an aggressive crawler far more easily than a major aggregator would be. Hold per-domain request rates to one or two requests per second with randomized jitter, respect Retry-After headers on 429 responses, and honor robots.txt directives. Burning goodwill with the companies whose data you depend on is a self-inflicted wound that no proxy rotation fully repairs — if a company's IT team identifies your crawler and blocks your IP ranges, you lose that source until you can establish a new access pattern.
When a source starts returning blocks despite polite pacing, the techniques in web scraping without getting blocked — residential proxies matched to the job posting's country, consistent browser-like headers, session reuse across requests to the same domain, and realistic inter-request timing — usually restore access without needing to escalate the request rate. Use the proxy field in the OmniScrape API to specify residential proxies for sources that have shown sensitivity; the goal is to look like ordinary job-seeker traffic, not to out-muscle the site's defenses.
Maintain a per-domain health log that records block rate, extraction success rate, and method_used distribution over time. A source that was resolving cleanly on fast for months and suddenly starts requiring js_rendering has changed its rendering approach or added bot detection — catch it in the health log before it silently degrades your data quality.
10.Legal governance and data provenance
LinkedIn and several large aggregators explicitly restrict automated scraping in their terms of service, and litigation in this space is active enough that 'the data is public' is not a defense your legal team will accept. Center the pipeline on public company career pages — Greenhouse, Lever, and direct careers subdomains — and licensed aggregator feeds, which are both more legally defensible and generally cleaner to parse than fighting a hostile platform's countermeasures.
Document which sources are scraped, which are licensed, and which are excluded by policy in a maintained data provenance register. Make it auditable: each source entry should record when it was added, who approved it, and the legal basis for inclusion. A clear governance boundary also prevents engineers from quietly adding a restricted source under deadline pressure, which is how most compliance incidents actually start. When a customer or counsel asks about the provenance of a specific data point, you want to answer in minutes from a document, not in days from a forensic git log review.
Personal data handling is a separate concern from ToS compliance. Job postings are about roles, not individuals, but recruiter contact details and applicant-facing personal information that sometimes appears in postings may be subject to GDPR or CCPA depending on your jurisdiction and customer base. Establish a policy on whether recruiter contact fields are stored, how long they are retained, and whether they are surfaced in the customer API — and apply it consistently from the first ingestion rather than retrofitting it after the corpus has grown.
Frequently asked questions
How do I construct a stable job_id when the URL contains no obvious identifier?
Prefer the numeric or alphanumeric ID embedded in the URL path, which Greenhouse, Lever, and most modern ATS platforms expose cleanly — for example, /jobs/442981 yields greenhouse_442981 as a stable, human-readable identifier. When no such ID exists, construct a deterministic hash from the tuple of company domain, normalized title, normalized location, and posted date. Accept that this fallback is slightly less reliable across reposts of the same role, and track the rate at which hash-based IDs collide with genuine new postings versus true reposts as a quality metric.
Workday portals keep blocking my requests — what is the right approach?
Start with mode auto and a residential proxy matched to the job posting's country, which clears most Workday bot challenges. Set enable_solver to true and add a js_wait_selector targeting the job description container, since Workday pages are client-rendered and the description loads asynchronously. If blocks persist despite these settings, the realistic options narrow to a browser automation session that mimics genuine user behavior or pursuing an official Workday data partnership. Workday is among the harder ATS platforms to scrape at scale, and the cost-per-posting on Workday sources will be meaningfully higher than on Greenhouse or Lever — factor that into your coverage decisions.
Should I scrape large aggregators directly?
Review each aggregator's terms of service and get explicit legal sign-off before adding them as a source. Large aggregators actively litigate unauthorized scraping, and the 'publicly accessible data' argument has had mixed outcomes in court. Many HR-tech products achieve broad coverage by combining company-direct career pages — which are generally scrape-permissive — with licensed data feeds from aggregators who offer them commercially, avoiding the legal exposure of direct scraping while still getting the long-tail coverage that aggregators provide.
How should I handle expired and filled job postings?
Re-fetch the posting URL on a weekly verification cycle and update is_active to false when the target returns HTTP 404 or redirects to a generic careers landing page — do not delete the record. Keeping inactive records with their full history preserves the data that hiring-velocity, time-to-fill, and skill-demand-trend charts depend on. Distinguish between a 404 (definitively closed) and a 503 (source temporarily unavailable) in your verification logic, because conflating them will mark live postings as inactive during source outages.
Can css_extractor reliably capture full job descriptions?
Yes, for most ATS-hosted pages. Greenhouse, Lever, and iCIMS use stable, predictable class names on their hosted job detail pages, and css_extractor with a js_wait_selector on the description container captures the full text reliably. Truncate at around 50,000 characters in the primary database for query performance and store the complete text in object storage for NLP reprocessing. The main failure mode is a page that loads description content via a secondary API call after initial render — in those cases, increase js_wait_timeout or switch to a js_wait_selector that targets the fully-loaded container rather than its parent.
How do I handle salary data responsibly when it is not publicly displayed?
Take public salary ranges exactly as displayed in the DOM and store them as salary_raw strings without attempting to parse or normalize at ingestion time. Do not attempt to access salary data behind authentication or an apply flow — that crosses into unauthorized access regardless of how the data is ultimately used. For markets where salary disclosure is legally required (several US states now mandate it), coverage will naturally improve as more postings include the field. Build your salary analytics on the subset of postings where the field is present and be explicit with customers about coverage rates rather than implying completeness.
What is the right way to version and update the NLP skills taxonomy?
Treat the taxonomy as a versioned artifact with semantic versioning and a changelog, stored in version control alongside the pipeline code. Patch versions add new skill aliases to existing canonical entries. Minor versions add new canonical skill entries. Major versions rename or restructure canonical entries, which are breaking changes for any customer query referencing skill names. When you release a major version, reprocess the historical corpus against the new taxonomy before making the new skill fields available in the customer API, so customers see a consistent history rather than a discontinuity at the upgrade date.
Related guides