1.Data fields worth extracting from LinkedIn
Before writing selectors, map what is actually available to a logged-out browser versus what requires authentication. LinkedIn aggressively gates high-value fields behind login — email addresses, full work histories, connection graphs, and InMail content are never accessible without an authenticated session, and automating that session violates the User Agreement.
The fields below are those visible on public pages to an unauthenticated request. Even these can disappear behind a login prompt if LinkedIn's bot detection flags your traffic — residential proxies and low request frequency reduce but do not eliminate that risk.
Sales intelligence teams typically want company size, industry, headquarters, and recent post activity. Recruiting and HR products want job title, location, seniority level, and posting date to feed freshness signals. Enrichment pipelines want public profile headlines and current employer to cross-reference against CRM records — not contact details.
- Company: name, tagline, industry vertical, employee count range (e.g. '10,001+ employees'), HQ city and country, follower count, LinkedIn URL slug
- Company about: founded year, company type (public/private/nonprofit), website URL, specialties list, description text
- Company posts: post text, reaction counts, comment counts, post URL, author name — only on public post pages
- Jobs: title, company name, location string, workplace type (Remote / Hybrid / On-site), time since posted, easy-apply flag
- Job description: responsibilities and requirements text, salary range when listed, applicant count when visible (e.g. 'Over 200 applicants')
- Public profiles (logged-out): headline, current role title and employer, location, connection count label ('500+ connections') — full work history is gated
- School and showcase pages: page name, follower count, description — same structure as company pages
2.LinkedIn URL patterns for public pages
Use canonical URLs directly. Avoid constructing URLs from search result fragments — LinkedIn's search pages are heavily rate-limited and often redirect to login for non-human traffic. Company slugs and job IDs are stable identifiers you can store and revisit.
Job search URLs (/jobs/search/) work without login for the first few pages but paginate via &start=25 increments and block aggressive crawlers quickly. Treat them as a discovery mechanism, not a bulk harvest endpoint. Individual job view URLs are far more reliable for extraction.
People search (/search/results/people/) and Sales Navigator are fully login-gated. Do not attempt to automate them.
- Company overview: https://www.linkedin.com/company/stripe/
- Company about tab: https://www.linkedin.com/company/stripe/about/
- Company posts tab: https://www.linkedin.com/company/stripe/posts/
- Job view (canonical): https://www.linkedin.com/jobs/view/3847291847/
- Job search (logged-out, limited depth): https://www.linkedin.com/jobs/search/?keywords=data+engineer&location=Berlin
- Public profile: https://www.linkedin.com/in/williamhgates/ — only headline and partial experience visible without login
- School page: https://www.linkedin.com/school/mit/
- Showcase page: https://www.linkedin.com/showcase/microsoft-azure/
3.Parsing LinkedIn's obfuscated markup
LinkedIn randomizes CSS class names on every deployment using CSS module hashing. A selector that works today — org-top-card-summary__title, for example — may be replaced with a meaningless hash within weeks. This is the primary reason LinkedIn scrapers require constant maintenance. Build monitoring into any production pipeline: alert on empty extracts rather than silently storing null values.
Stable anchors that survive deploys more reliably include: JSON-LD structured data embedded in script tags, data-test-id attributes on some job page elements, semantic HTML structure (h1 as the first heading in the top card, the first anchor inside a company span), and ARIA labels on interactive elements. JSON-LD is the most reliable — LinkedIn embeds jobPosting schema on job view pages and Organization schema on some company pages.
For job postings, parse the JSON-LD block first. It typically contains title, datePosted, hiringOrganization.name, jobLocation, and description. Fall back to CSS selectors only for fields not in structured data — applicant count and workplace type are rarely in JSON-LD.
Company employee counts appear as text near the top card — look for a span or anchor containing the string 'employees' and extract the preceding number range. The exact wrapper element changes; a text-content search is more durable than a class-name selector.
Job descriptions often render truncated with a 'Show more' button. The full text is inside a container that requires JavaScript interaction to expand — this is why js_rendering mode is necessary for complete description extraction.
4.LinkedIn's anti-bot enforcement stack
LinkedIn operates one of the more sophisticated bot-detection systems among public websites. It combines IP reputation scoring (datacenter ranges are blocked almost immediately), behavioral fingerprinting, TLS fingerprint analysis, rate limiting at the IP and session level, and authentication walls that appear selectively based on traffic patterns. A request that returns full HTML on the first hit may return a login redirect on the tenth from the same IP.
The legal layer compounds the technical one. hiQ Labs v. LinkedIn (Ninth Circuit, 2022 remand) held that scraping publicly accessible data may not violate the CFAA, but that ruling does not override LinkedIn's User Agreement, which explicitly prohibits scraping, crawling, and bot use. Microsoft has filed and won injunctions against scrapers. Commercial products reselling LinkedIn-derived datasets face the highest exposure.
For any access pattern you do pursue on public pages: use residential proxies, keep request rates well below human browsing speed, do not attempt to log in programmatically, and do not scrape fields that are only visible after authentication. The OmniScrape Web Unlocker returns what a logged-out browser sees — it does not bypass authentication.
- Immediate blocks on datacenter IP ranges — residential proxies are not optional
- Login redirect walls on most profile fields beyond headline and current role
- Rate limits and CAPTCHA challenges on job search pagination beyond page 3–4
- CSS class name randomization breaking static selectors within weeks
- JavaScript-rendered content requiring headless browser execution for full text
- TLS and browser fingerprint checks that flag non-browser HTTP clients
- Contractual prohibition in LinkedIn User Agreement, actively enforced by Microsoft
- Legal precedent: LinkedIn has obtained injunctions against commercial scrapers
5.Scrape a public company page
Company overview and about pages are the lowest-risk LinkedIn targets — they are public marketing content LinkedIn explicitly wants indexed by search engines. Use mode 'auto' with a residential proxy. The css_extractor output format lets OmniScrape run selector matching server-side and return only the fields you need.
The selectors below target the logged-out company page structure as of the current guide revision. Treat them as starting points — validate against live HTML and update when extracts go empty. The about section selector targets the description paragraph inside the about module; website and employee count are in the info list near the top card.
If the page returns a login redirect instead of company content, LinkedIn has flagged the request. Reduce frequency, rotate residential proxy endpoints, and avoid hitting the same company URL more than once per session.
12345678910111213141516171819{
"url": "https://www.linkedin.com/company/stripe/about/",
"mode": "auto",
"output_format": "css_extractor",
"proxy": "residential:us",
"enable_solver": true,
"css_selectors": {
"name": "h1",
"tagline": "p.org-top-card-summary__tagline",
"industry": "div[data-test-id='about-us__industry'] dd",
"employee_count": "div[data-test-id='about-us__size'] dd",
"headquarters": "div[data-test-id='about-us__headquarters'] dd",
"founded": "div[data-test-id='about-us__foundedOn'] dd",
"company_type": "div[data-test-id='about-us__organizationType'] dd",
"website": "div[data-test-id='about-us__website'] a",
"description": "p[data-test-id='about-us__description']",
"followers": "span[data-test-id='org-top-card-followers-count']"
}
}
6.Scrape an individual job posting
Individual job view URLs (/jobs/view/JOB_ID/) are the most reliable LinkedIn extraction target. The page structure is more consistent than search results, and LinkedIn embeds jobPosting JSON-LD structured data that survives class name randomization. Use js_rendering mode because the full job description requires JavaScript to expand — the 'Show more' button must execute before the complete markup is present in the DOM.
The js_wait_selector targets the job title heading, which appears once the top card has rendered. Set js_wait_timeout to at least 10 seconds — LinkedIn's JS bundle is large and cold-start render times on residential proxies can be slow.
After extraction, always attempt to parse the JSON-LD block from the raw HTML as a secondary data source. The css_extractor fields below cover what is not in structured data: applicant count, workplace type badge, and the expanded description container.
Applicant count (e.g. 'Over 200 applicants') and salary range appear inconsistently — they are present on some postings and absent on others. Handle missing fields gracefully in your parser.
123456789101112131415161718192021{
"url": "https://www.linkedin.com/jobs/view/3847291847/",
"mode": "js_rendering",
"output_format": "css_extractor",
"proxy": "residential:us",
"enable_solver": true,
"js_wait_selector": "h1.top-card-layout__title",
"js_wait_timeout": 12000,
"css_selectors": {
"title": "h1.top-card-layout__title",
"company": "a.topcard__org-name-link",
"location": "span.topcard__flavor--bullet",
"workplace_type": "span.workplace-type",
"posted": "span.posted-time-ago__text",
"applicants": "span.num-applicants__caption",
"salary": "div.salary.compensation__salary",
"description": "div.show-more-less-html__markup",
"seniority_level": "span.description__job-criteria-text:nth-of-type(1)",
"employment_type": "span.description__job-criteria-text:nth-of-type(2)"
}
}
7.Job board aggregation at scale
For systematic job harvesting across multiple sources, see the dedicated job board web scraping guide. LinkedIn-specific considerations at scale are more restrictive than most job boards.
Discovery: use the job search URL with &keywords= and &location= parameters to find job IDs, then scrape individual /jobs/view/JOB_ID/ pages. Do not rely on search result HTML for description text — it is always truncated. Extract the job ID from the URL (the numeric segment) and use it as your deduplication key. Title-based deduplication fails because the same role is often posted multiple times by the same employer.
Pagination: job search paginates via &start=25, &start=50, etc. LinkedIn blocks crawlers that paginate aggressively. Limit to the first 3–4 pages per query, rotate queries rather than paginating deep, and introduce delays between requests that reflect human browsing cadence — several seconds minimum, not milliseconds.
Freshness: employers edit job descriptions in place without changing the job ID or URL. Archive the full description text with a scraped_at timestamp on every fetch. Compare description hashes across runs to detect silent edits. The datePosted field in JSON-LD reflects original posting date, not last edit.
Deduplication across sources: if you aggregate from multiple job boards, the same posting often appears on LinkedIn, Indeed, and the company's own careers page. Normalize on (company_name, job_title, location, posted_date) as a composite key, not on URL.
Volume limits: there is no published rate limit from LinkedIn. Treat any 429 or redirect-to-login response as a signal to back off for that IP endpoint. Residential proxy rotation across a large pool is the primary mitigation — do not retry immediately on block.
8.Official APIs and licensed data alternatives
LinkedIn's official data access programs exist for specific use cases and require approval. The LinkedIn Marketing API covers ad targeting and company page analytics for authorized partners. Talent Solutions and Recruiter contracts provide job posting data and applicant tracking integrations. Sales Navigator has a limited API for authorized CRM vendors. None of these are self-serve for arbitrary data extraction, but they provide legally clean access to the fields you actually need.
For company firmographics without scraping LinkedIn directly, Crunchbase and PitchBook offer structured company graphs with funding, headcount, and industry data via licensed API. People Data Labs aggregates professional identity data (current employer, title, location) from public sources and provides it via API with clear data licensing terms. Clearbit (now part of HubSpot) offers company enrichment on domain lookup. These are the right starting point for B2B enrichment pipelines that need to scale.
For job market intelligence, the Bureau of Labor Statistics JOLTS data, Indeed Hiring Lab, and Lightcast (formerly EMSI Burning Glass) provide aggregated labor market signals without the legal exposure of scraping LinkedIn job postings.
If you have a legitimate need to automate LinkedIn flows for accounts you own and control — for example, managing your own company page programmatically — a Browser-as-a-Service approach with your authenticated session is the technical path. It is still subject to LinkedIn's User Agreement limits on automation, so review those terms with counsel before proceeding.
9.User Agreement and legal risk — read before writing code
LinkedIn's User Agreement (Section 8.2) explicitly prohibits scraping, crawling, spidering, and using bots or other automated means to access the service without LinkedIn's express written permission. This applies regardless of whether the data is publicly visible. Microsoft enforces this: LinkedIn has obtained temporary restraining orders and injunctions against commercial scrapers, and has pursued damages claims.
The hiQ Labs v. LinkedIn litigation (Ninth Circuit opinions in 2019 and 2022) established that scraping publicly accessible data may not constitute unauthorized access under the Computer Fraud and Abuse Act — a narrow technical holding about one federal statute. It does not override the User Agreement as a contract. It does not apply in jurisdictions outside the US. It does not protect resellers of LinkedIn-derived datasets. It does not address state computer crime laws. Do not treat hiQ as a green light.
The practical risk gradient: scraping individual public company pages for internal research at low volume is lower risk than bulk job harvesting, which is lower risk than profile scraping, which is lower risk than automating logged-in sessions, which is lower risk than reselling LinkedIn-derived data commercially. Every step up that gradient increases legal exposure.
OmniScrape provides technical infrastructure for web data extraction. Whether to use that infrastructure on LinkedIn, and for what purpose, is a business and legal decision your organization must make explicitly — ideally with counsel who has reviewed LinkedIn's current User Agreement and the applicable law in your jurisdiction. This guide is a technical reference, not legal advice, and not an endorsement of scraping LinkedIn.
Frequently asked questions
Can I scrape LinkedIn profiles without logging in?
A logged-out browser sees only the profile headline, current role title and employer, general location, and a truncated connection count label. Full work history, education details, skills, recommendations, contact information, and connection graphs are all gated behind authentication. LinkedIn's User Agreement prohibits automating logged-in sessions, so in practice the logged-out view is the only compliant target — and it contains limited data for most enrichment use cases.
Why do my LinkedIn CSS selectors break every few weeks?
LinkedIn uses CSS module hashing in its frontend build pipeline, which generates randomized class names on each deployment. A class like 'org-top-card-summary__title' may be replaced with an opaque hash string within weeks. Build your selectors around stable anchors that survive deploys: JSON-LD structured data in script tags (most reliable), data-test-id attributes, ARIA labels, semantic HTML structure (h1 as the first heading in the top card), and text-content matching for labels like 'employees'. Monitor extracts in production and alert on empty results rather than silently storing null values.
Does OmniScrape bypass LinkedIn's login walls?
No. The OmniScrape Web Unlocker returns what a logged-out browser sees after resolving bot challenges on public pages. It does not authenticate with LinkedIn or access content that requires a user session. If LinkedIn returns a login redirect for a given URL, that content is not accessible without authentication and OmniScrape cannot retrieve it. Bypassing authentication to access gated content is outside compliant use of the API.
What is the lowest-risk LinkedIn data to collect?
Individual job postings on /jobs/view/ URLs and company about pages are the lowest-risk targets — they are public marketing content LinkedIn explicitly indexes in search engines. Collect at low volume, with residential proxies, with delays between requests, and with legal review of your specific use case. The highest-risk activities are: scraping profile data at scale, automating logged-in sessions, and reselling LinkedIn-derived datasets commercially. For any production use case, evaluate official LinkedIn APIs and licensed data providers first.
How do I extract the full job description when it is truncated?
LinkedIn renders job descriptions with a 'Show more' button that requires JavaScript execution to expand the full text. Use mode 'js_rendering' with a js_wait_selector targeting the job title heading (h1.top-card-layout__title) and a js_wait_timeout of at least 10–12 seconds. The full description text appears in div.show-more-less-html__markup after the expand interaction completes. Additionally, parse the JSON-LD jobPosting block in the page's script tags — it sometimes contains the complete description text without requiring JS interaction.
How should I deduplicate LinkedIn job postings?
Use the numeric job ID extracted from the /jobs/view/JOB_ID/ URL as your primary deduplication key — not the job title or description text. The same role is frequently posted multiple times by the same employer with different job IDs, and employers edit descriptions in place without changing the ID. Store a scraped_at timestamp and a hash of the description text on every fetch so you can detect silent edits. For cross-source deduplication (LinkedIn plus Indeed plus company careers page), normalize on a composite key of (company_name, job_title, location, posted_date).
What does the hiQ v. LinkedIn ruling actually mean for scraping?
The Ninth Circuit held that LinkedIn could not invoke the Computer Fraud and Abuse Act to block hiQ from scraping publicly accessible profile pages, because accessing public data does not constitute 'unauthorized access' under that specific statute. This is a narrow holding about one US federal law. It does not override LinkedIn's User Agreement as a contract. It does not apply outside the US. It does not protect commercial resale of LinkedIn data. It does not address state computer crime statutes. Courts in subsequent cases have distinguished hiQ on various grounds. Treat it as a data point for your legal team's analysis, not as a general permission to scrape LinkedIn.
Related guides