1.Indeed job fields worth extracting
Before writing a single selector, decide which fields your pipeline actually needs. Indeed exposes a rich set of structured and semi-structured data on each posting, but not every field is present on every listing. Salary disclosure in particular is sparse — US employers are only required to disclose in a handful of states, so treat a missing salary field as expected behavior rather than a scraping error.
Workforce analytics teams typically prioritise posting velocity by skill cluster and metro area, using the posting date and location fields as the primary dimensions. Compensation teams focus on the salary range text when present, normalising the free-text strings ("$120,000 - $150,000 a year", "$65 an hour") into a canonical min/max/period schema in their ETL. Recruiting intelligence products care most about the apply destination — whether a role routes to Indeed Easy Apply or an external ATS — because that determines downstream conversion tracking.
- Job key (jk) — stable Indeed identifier, survives reposts to the same URL
- Job title as displayed and your ETL-normalised title
- Company name and employer star rating when rendered
- Location string, remote/hybrid label, and job type (full-time, part-time, contract, internship)
- Salary range text when disclosed (e.g. "$120,000 - $150,000 a year")
- Posting date (relative label on search cards, ISO date in structured data on viewjob)
- Full job description HTML — the richest signal for skill extraction
- Apply link destination: Indeed Easy Apply vs external employer ATS URL
- Sponsored vs organic flag on search result cards
- Employer benefits snippet when present
- Number of applicants indicator when shown ("Over 200 applicants")
2.Indeed URL patterns and pagination mechanics
Indeed has two primary URL surfaces worth scraping: the viewjob detail page and the search results page. They differ significantly in bot-detection aggressiveness and in the data they expose.
The viewjob URL is the canonical, durable artifact. It is keyed on the jk parameter, which is an alphanumeric identifier assigned by Indeed when a listing is indexed. This URL remains stable for the lifetime of the posting and is the correct target for detail extraction. Search URLs, by contrast, encode transient query state and are significantly more likely to trigger Cloudflare challenges at scale — treat them as a discovery mechanism to harvest jk values, not as your primary scrape target.
For country-specific catalogs, each Indeed TLD (indeed.co.uk, indeed.de, indeed.com.au) maintains its own independent job index. Job keys do not transfer across TLDs. When scraping non-US markets, use a proxy egress point in the target country and target the matching TLD.
- Job detail: https://www.indeed.com/viewjob?jk=a1b2c3d4e5f6g7h8
- Search page 1: https://www.indeed.com/jobs?q=machine+learning+engineer&l=San+Francisco%2C+CA&start=0
- Search page 2: increment start by 10 — &start=10, &start=20, etc.
- Remote filter: append &remotejob=032b3046-06a3-4876-8dfd-474eb5e7ed11
- Date filter: &fromage=1 (last 24 h), &fromage=3, &fromage=7, &fromage=14
- Salary search: https://www.indeed.com/career/salaries/data-scientist/San-Francisco--CA
- UK catalog: https://www.indeed.co.uk/jobs?q=software+engineer&l=London
- DE catalog: https://www.indeed.de/jobs?q=python+entwickler&l=Berlin
3.Indeed DOM structure and CSS selectors
Indeed's frontend is a React application that server-side renders the initial HTML payload. This means the core job data is present in the raw HTML response — you do not need JavaScript execution for viewjob pages under normal conditions. However, Indeed periodically rotates CSS class names and data-testid values, so build your selectors to be resilient: prefer data-testid attributes over generated class names where both are available, and maintain a fallback selector chain.
On viewjob pages, the job title is in an h1 element carrying either the class jobsearch-JobInfoHeader-title or the attribute data-testid="jobTitle" — both have been observed in production. Company name appears in a div or span with data-testid="inlineHeader-companyName". Location is in div[data-testid="job-location"]. The salary block, when present, is in div#salaryInfoAndJobType, which contains both the salary range span and the job type span as siblings.
The full job description is inside div#jobDescriptionText, which contains a mix of paragraphs, unordered lists, and occasionally tables. This element's id has been stable for several years. On search result cards, each card carries a data-jk attribute on its outermost anchor or li element — this is your primary mechanism for harvesting job keys from search pages without parsing the viewjob URL.
4.Cloudflare bot management and geo-specific catalogs
Indeed runs Cloudflare Bot Management (not the free Turnstile tier) on its search endpoints and on high-volume viewjob access patterns. Cloudflare Bot Management uses TLS fingerprinting, HTTP/2 frame ordering analysis, and behavioural heuristics — datacenter IP ranges fail immediately regardless of header spoofing. Residential proxies with a matching country egress are the baseline requirement.
Search pagination is the highest-friction surface. Cloudflare challenge frequency increases sharply after the first three to four pages of results for a given query. The practical mitigation is to slow your search crawl significantly — 8 to 15 seconds between requests is a reasonable floor — and to pivot to viewjob scraping as soon as you have a batch of jk values. viewjob pages at moderate rate (one every 3–5 seconds) are substantially less likely to trigger challenges than search pages.
Salary data availability is also geo-gated. US states with pay transparency laws (Colorado, New York, California, Washington) produce a higher density of salary-disclosed listings. Non-US TLDs follow their own disclosure norms. Configure your proxy egress to match the TLD you are targeting to avoid geo-mismatch responses.
- Cloudflare Bot Management on search and high-volume viewjob access
- TLS fingerprint checks — datacenter IPs fail at the TLS handshake layer
- Geo-specific salary disclosure rules per country TLD
- Pagination challenge escalation past page 3–4 on search
- Sponsored listings interleaved with organic results, both carrying data-jk
- Periodic data-testid and CSS class name rotation (monthly to quarterly cadence)
- Rate-based blocking independent of Cloudflare — Indeed's own WAF layer
5.Scrape a single Indeed job posting
viewjob URLs are your lowest-risk scrape target. Use mode auto so OmniScrape tries a fast HTTP request first and escalates to a headless browser only if Cloudflare intervenes. Set proxy to residential:us for US listings (adjust country suffix for other TLDs). The css_extractor output format runs selector evaluation server-side and returns only the extracted fields, keeping response payloads small.
The selectors below cover the current DOM structure. If a selector returns null for a field that you expect to be present, check whether Indeed has rotated the data-testid value — the fallback approach is to switch output_format to html and parse body.data.content with a server-side HTML parser using a selector fallback chain.
123456789101112131415161718{
"url": "https://www.indeed.com/viewjob?jk=3f8b2a1c9d0e4f5a",
"mode": "auto",
"output_format": "css_extractor",
"proxy": "residential:us",
"enable_solver": true,
"css_selectors": {
"title": "h1[data-testid=\"jobTitle\"], h1.jobsearch-JobInfoHeader-title",
"company": "[data-testid=\"inlineHeader-companyName\"]",
"location": "[data-testid=\"job-location\"]",
"salary": "#salaryInfoAndJobType",
"job_type": "#salaryInfoAndJobType span",
"description": "#jobDescriptionText",
"posted": "span[data-testid=\"myJobsStateDate\"], span.date",
"apply_button": "[data-testid=\"applyButton\"]",
"benefits": "[data-testid=\"benefits-test\"]"
}
}
6.Scrape Indeed search results and paginate
Search scraping serves one primary purpose in a well-designed pipeline: harvesting jk job keys that you then feed into a viewjob scrape queue. Extract the data-jk attribute from each card rather than trying to parse all detail fields from the search card DOM — card data is truncated and inconsistently structured compared to the viewjob page.
Keep &start= pagination slow. A request every 10–15 seconds per query is a reasonable starting rate. If you need to cover many queries in parallel, distribute them across separate session IDs so each session sees a low per-query request rate. Stop pagination when the result set returns fewer cards than the page size (typically 15 on desktop user agents) — this signals you have reached the end of the result set for that query.
12345678910111213141516{
"url": "https://www.indeed.com/jobs?q=python+developer&l=Austin%2C+TX&start=0&fromage=7",
"mode": "auto",
"output_format": "css_extractor",
"proxy": "residential:us",
"enable_solver": true,
"css_selectors": {
"titles": "h2.jobTitle span[title]",
"companies": "[data-testid=\"company-name\"]",
"locations": "[data-testid=\"text-location\"]",
"salaries": ".salary-snippet-container, .salaryOnly",
"job_keys": "[data-jk]",
"sponsored_flags": "[data-testid=\"ad-IndeedApply\"]",
"posting_dates": "span.date"
}
}
7.Handling Cloudflare challenges on Indeed
When Indeed returns a Cloudflare interstitial, the response HTML will contain a challenge page rather than job data — body.data.content will include strings like "Just a moment" or "Checking your browser". The correct response is not to retry immediately; that compounds the block. Instead, back off for 30–60 seconds before the next attempt on that session.
The most effective configuration for Indeed is mode auto with enable_solver: true and a residential proxy. This tells OmniScrape to attempt a fast HTTP request first, and if Cloudflare responds with a challenge, to escalate to a headless browser session that can execute the Cloudflare JavaScript challenge and return the actual page content. The Web Unlocker capability handles the challenge resolution transparently — you receive body.data.content containing the real job page HTML or extracted CSS fields.
If search endpoints remain blocked after solver escalation, fall back to scraping viewjob URLs directly for the jk values you already have. viewjob pages at moderate rate almost always succeed when search is temporarily blocked. You can also use Indeed's public RSS feeds (https://www.indeed.com/rss?q=...&l=...) as a low-friction seed source for fresh jk values — RSS responses are plain XML and rarely challenged. For a detailed breakdown of Cloudflare bypass techniques, see Cloudflare bypass.
8.Deduplicating Indeed job postings
Deduplication on Indeed is non-trivial because the same underlying role can appear under multiple jk values. Employers repost expired listings, staffing agencies post the same role across multiple accounts, and Indeed's own indexing sometimes creates duplicate entries for a single ATS posting. A jk-only deduplication strategy will miss a meaningful fraction of duplicates.
A robust deduplication approach uses two layers. The first is exact deduplication on jk — store each jk you have processed and skip re-scraping it unless the posting date has changed, which signals an edit. The second is fuzzy deduplication on a composite key of normalised company name + normalised job title + normalised location. Use a string similarity measure (Levenshtein or trigram similarity) with a threshold around 0.85 to catch near-duplicate titles like "Senior Python Developer" vs "Sr. Python Developer".
For description-level change detection, store a SHA-256 hash of the div#jobDescriptionText inner HTML. When a jk you have already processed reappears in search results, re-scrape and compare hashes — a changed hash indicates the employer edited the description, which is a meaningful signal for job board freshness tracking. If you are merging Indeed data with other sources, see job board web scraping for cross-source ID reconciliation patterns.
9.Indeed Terms of Service and legal considerations
Indeed's Terms of Service prohibit automated access, scraping, and republication of job listings without explicit written permission. Operating a competing job board or job aggregation product using Indeed data carries material legal risk and has been the subject of cease-and-desist actions. If your use case involves redistributing Indeed listings, you need legal review before proceeding.
Internal labour market research at low volume — for example, a compensation benchmarking tool used within a single organisation, or an academic study of hiring trends — presents a different risk profile. The practical and legal exposure is substantially lower when data is not redistributed and volume is modest. That said, this guide does not constitute legal advice; consult counsel for your specific use case.
From a technical compliance standpoint, respect the robots.txt directives, do not hammer the site at rates that degrade service for other users, and do not store personally identifiable information from job postings beyond what your use case requires.
Frequently asked questions
How do I paginate Indeed search results?
Increment the &start= parameter by 10 per page (&start=0, &start=10, &start=20, and so on). Indeed typically returns 15 results per page on desktop user agents, but the start increment is always 10 — this is a quirk of their pagination implementation. Stop when the number of job cards returned is less than the expected page size, which indicates you have reached the end of the result set. Space requests at least 8–15 seconds apart on search endpoints to reduce Cloudflare challenge frequency.
Why is the salary field empty on most Indeed job postings?
Salary disclosure on Indeed is voluntary in most US states and most non-US markets. Only a minority of postings include a salary range — estimates vary but 20–35% is a commonly observed range depending on job category and geography. US states with pay transparency laws (Colorado, New York, California, Washington) produce higher disclosure rates. An empty salary field is expected behaviour; do not treat it as a scraping failure. When salary is present, it appears in div#salaryInfoAndJobType as a text string that requires normalisation ("$65 an hour", "$120,000 - $150,000 a year").
Are job keys (jk) the same across indeed.com and indeed.co.uk?
No. Each Indeed country TLD maintains an independent job index with its own jk namespace. A jk from indeed.com is not valid on indeed.co.uk and vice versa. When scraping multiple country catalogs, store jk values with a country prefix or in separate namespaces, and use a proxy egress point matching the target country TLD to avoid geo-mismatch responses.
Does Indeed block OmniScrape specifically?
Indeed's bot management targets traffic patterns and infrastructure characteristics — datacenter IP ranges, TLS fingerprints, and request rate signatures — not specific vendors by name. Using mode auto with enable_solver: true and a residential proxy routes requests through IPs that are indistinguishable from organic browser traffic, which resolves the vast majority of Cloudflare challenges. If you are still seeing challenge pages, verify that your proxy is set to residential:us (or the matching country) and that you are not exceeding one request every 8 seconds on search endpoints.
Should I scrape search pages or viewjob pages?
Use search pages only to discover jk job keys, then scrape viewjob pages for the actual data. Search pages are higher friction (more Cloudflare challenges, truncated data, inconsistent card structure) and lower yield per request. viewjob pages are lower friction, fully structured, and contain the complete job description. The optimal pipeline is: slow search crawl to collect jk values → queue-based viewjob scrape at moderate rate → CSS extraction of structured fields.
How do I detect when a job posting has been edited vs reposted?
Store a SHA-256 hash of the div#jobDescriptionText inner HTML alongside the jk and scrape timestamp. When a jk you have already processed reappears in search results, re-scrape the viewjob page and compare hashes. A changed hash with the same jk indicates an in-place edit. A new jk for what appears to be the same role (detected via fuzzy deduplication on company + title + location) indicates a repost. Track both signals separately — edits and reposts have different implications for job board freshness and employer behaviour analysis.
Can I use Indeed RSS feeds instead of scraping search pages?
Yes, and for many use cases it is the better approach. Indeed exposes RSS feeds at https://www.indeed.com/rss?q=QUERY&l=LOCATION that return the most recent listings for a query as plain XML. RSS responses are rarely Cloudflare-challenged, contain the jk in the link element, and are a low-friction way to maintain a feed of fresh job keys. The limitation is that RSS only surfaces the most recent ~25 listings per query and does not support the full filter set available on the search UI. Use RSS as a seed source and fall back to search scraping when you need historical depth or advanced filters.
Related guides