1.Crunchbase fields deal teams and sales teams extract
VC sourcing teams prioritize funding announcements: round type, date, amount, and lead investors. Sales enrichment pipelines care about employee range, category tags, HQ location, and the company website for domain matching. Journalists and analysts want acquisition events and IPO history. Understanding which fields are freely visible versus paywalled determines what your scraper can realistically return.
Fields marked as paywalled below will render as blurred or empty DOM nodes on free views — an empty CSS extraction result for those selectors does not mean your selector is wrong. It means Crunchbase is intentionally hiding the value.
- Organization permalink slug and UUID (embedded in page source JSON)
- Company name, short description, and logo URL
- Founded date, operating status, and closed date if applicable
- Headquarters city, region, and country
- Company website URL
- Category and industry tags
- Employee count range (e.g., 1001–5000)
- Total funding amount — paywalled on most free views
- Number of funding rounds and last funding date
- Last funding type and lead investor names — partially paywalled
- Acquisition targets and acquirer (when public)
- IPO date and stock exchange (when applicable)
- Founder and key people profile links
2.Crunchbase URL patterns and permalink stability
Crunchbase uses human-readable slugs for organization and person permalinks. These slugs are stable enough to use as primary keys in most pipelines — companies rarely change their Crunchbase slug even after rebranding, though it does happen. The UUID embedded in the page source JSON is more durable if you can extract it.
Funding round URLs include a hash suffix that acts as a stable identifier for that specific round event. Bookmark these when you want to track a specific raise over time rather than re-scraping the parent organization page.
- Organization: https://www.crunchbase.com/organization/stripe
- Person: https://www.crunchbase.com/person/patrick-collison
- Funding round: https://www.crunchbase.com/funding_round/stripe-series-h--abc123
- Acquisition: https://www.crunchbase.com/acquisition/company-acquires-target
- Discover search (login-gated): https://www.crunchbase.com/discover/organization.companies
- Category hub: https://www.crunchbase.com/hub/fintech-companies
3.Organization page DOM structure
Crunchbase organization pages are Angular single-page applications. The server renders an initial HTML shell with some metadata, but most content is hydrated client-side. This means fast HTTP-only requests will capture the shell and any JSON-LD embedded in the document, but funding sections and people cards require JavaScript execution to appear in the DOM.
Key selectors on a fully rendered organization page: company name in `h1.profile-name`; short description in `span.description`; headquarters in `span.field-type-address`; website in `a.component--field-formatter.field-type-link`; employee range in `a.field-type-enum`; category chips in `span.chip`; founded date in `span.field-type-date`. Funding round rows render inside `section#funding-rounds` as table rows once the Angular component loads.
Paywalled fields are wrapped in elements with class `cb-paywall`. These nodes exist in the DOM but their text content is replaced with a blur overlay and a prompt to upgrade. Your CSS extractor will return empty strings for those selectors — not an error, just a paywall signal. Crunchbase also embeds JSON-LD `Organization` schema on some pages with `name` and `url` properties, but funding detail is almost never included in the structured data.
4.Paywalls, anti-bot detection, and rate limits
Crunchbase runs layered defenses. At the network layer, datacenter IP ranges are rate-limited aggressively on organization page requests — you will see 429s or silent redirects to the homepage within a small number of sequential requests from a single datacenter IP. Residential proxies reduce this friction significantly.
At the application layer, the Pro paywall blurs funding amounts and investor lists for unauthenticated or free-tier sessions. This is not bot detection — it is deliberate content gating. Attempting to bypass it by injecting session cookies from a paid account violates Crunchbase Terms and potentially computer fraud statutes depending on jurisdiction.
The discover search and export features require an active login session and are heavily rate-limited even for Pro users. Do not attempt to automate discover exports — scrape known organization permalinks from a seed list instead.
- `cb-paywall` class overlays on funding amounts and investor details
- Login required for discover search and CSV exports
- Aggressive rate limits on organization pages from datacenter IPs
- Angular SPA hydration required for most content sections
- Frequent component class name changes breaking CSS selectors
- Legal terms explicitly prohibiting scraping and automated collection
5.Scraping public organization fields with OmniScrape
Use `mode: "auto"` with a residential US proxy for organization pages. OmniScrape will attempt a fast HTTP request first and escalate to a headless browser if the page requires JavaScript rendering. For basic firmographic fields — name, description, location, categories, employee range — the initial HTTP response often contains enough rendered HTML to extract values without full JS execution.
Target only the fields that are freely visible on public pages. If a selector returns an empty string, check whether the field is behind a `cb-paywall` overlay before debugging your selector. The `enable_solver` flag activates OmniScrape's Web Unlocker to handle bot challenges that may appear on high-volume scraping sessions.
1234567891011121314151617{
"url": "https://www.crunchbase.com/organization/openai",
"mode": "auto",
"output_format": "css_extractor",
"enable_solver": true,
"proxy": "residential:us",
"css_selectors": {
"name": "h1.profile-name",
"description": "span.description",
"location": "span.field-type-address",
"website": "a.component--field-formatter.field-type-link",
"employees": "a.field-type-enum",
"categories": "span.chip",
"founded": "span.field-type-date",
"operating_status": "span.field-type-enum[href*='operating_status']"
}
}
6.Extracting the funding rounds section
The funding rounds section is rendered by an Angular component that loads asynchronously after the initial page shell. Use `mode: "js_rendering"` with `js_wait_selector` pointing to `section#funding-rounds` so OmniScrape waits for the component to hydrate before extracting. Set `js_wait_timeout` to at least 10–12 seconds — Crunchbase's Angular bootstrap is slow on cold loads.
Round rows that are not paywalled will contain the funding date, round type label, and a link to the round detail page. The amount and lead investor name may be empty strings if the session is not authenticated to Pro. Extract investor links by targeting anchor tags with `href` containing `/organization/` inside the funding section — these point to investor organization pages you can follow-scrape.
Store the round detail URL (e.g., `/funding_round/stripe-series-h--abc123`) as a stable foreign key. Re-scraping the parent organization page will re-surface the same round — the round URL is your deduplication handle.
12345678910111213141516{
"url": "https://www.crunchbase.com/organization/anthropic",
"mode": "js_rendering",
"output_format": "css_extractor",
"enable_solver": true,
"proxy": "residential:us",
"js_wait_selector": "section#funding-rounds",
"js_wait_timeout": 12000,
"css_selectors": {
"round_dates": "section#funding-rounds span.field-type-date",
"round_types": "section#funding-rounds a[href*='funding_round']",
"round_amounts": "section#funding-rounds span.field-type-money",
"investors": "section#funding-rounds a[href*='/organization/']",
"total_funding": "span[data-test='funding-total']"
}
}
7.Crunchbase Enterprise API and licensed data access
Crunchbase sells licensed API access and bulk data exports through its Enterprise tier. If you are building a product that surfaces Crunchbase funding data to end users — a CRM enrichment tool, an investor intelligence platform, a sales prospecting product — you almost certainly need a license rather than a scraper. Scraping free public fields and reselling compiled funding datasets competes directly with Crunchbase's core business and carries significant legal exposure.
The Enterprise API returns structured JSON with full funding detail, investor relationships, and historical round data. It is rate-limited but documented, and the data model is stable compared to CSS selectors that break whenever Crunchbase ships an Angular component update. For internal research use cases — a VC analyst running one-off lookups, a journalist verifying a funding claim — scraping publicly visible fields with counsel sign-off is a different risk profile than a commercial data product.
Evaluate the build-versus-buy decision honestly: the engineering cost of maintaining Crunchbase CSS selectors against frequent DOM changes, plus residential proxy costs, plus legal review, often exceeds the Enterprise API cost for production workloads.
8.Using permalinks and UUIDs as primary keys
Store the organization permalink slug — the human-readable portion of the URL like `openai` or `anthropic` — as your primary key for company records. This slug is stable across most rebrands and is the canonical identifier Crunchbase uses in all cross-links between organizations, funding rounds, and people.
For higher durability, extract the UUID from the embedded JSON in the page source. Crunchbase embeds a JSON blob in a `<script>` tag containing the organization's UUID, which persists even if the slug changes after an acquisition or rebrand. Parse this with a regex or JSON path extractor from the raw HTML response (`body.data.content`) before running CSS extraction.
Model funding events as separate rows keyed by the round URL slug. A single organization scrape may surface multiple rounds — store each as an independent record with the parent organization permalink as a foreign key. This lets you incrementally update round records without re-processing the full organization history on every scrape cycle.
9.Crunchbase Terms of Service and legal considerations
Crunchbase's Terms of Service explicitly prohibit automated scraping, crawling, and data collection. Section 4 of their Terms restricts use of robots, spiders, or automated tools to access the service. Paywall bypass — whether by injecting Pro session cookies, intercepting API calls, or circumventing the `cb-paywall` overlay — constitutes unauthorized access to paid content and may violate the Computer Fraud and Abuse Act in the US and equivalent statutes in other jurisdictions.
OmniScrape provides the technical capability to make HTTP and browser-rendered requests to publicly accessible URLs. It does not grant any rights to the data returned by those requests. The legality of collecting, storing, and using Crunchbase data depends on your jurisdiction, your use case, and whether the data is publicly visible without authentication. Get legal counsel before building a commercial product on scraped Crunchbase data.
For publicly visible fields collected at low volume for internal research — verifying a funding claim, enriching a small prospect list — the risk profile is different from bulk collection and redistribution. Document your use case, respect robots.txt, use rate limiting, and do not attempt to access paywalled content.
Frequently asked questions
Why does my Crunchbase scraper return empty funding amounts?
Funding amounts on Crunchbase are paywalled behind Pro for most organizations on free views. The DOM node exists but its text content is replaced with a blur overlay — your CSS selector is correct, but the value is intentionally hidden. You will see the same empty result whether you scrape with a browser or a headless tool. The only legitimate way to access the full amount is through a Pro account or the Enterprise API.
Do I need js_rendering mode for Crunchbase organization pages?
It depends on which fields you need. Basic firmographic fields — name, description, location, categories, employee range — are often present in the initial server-rendered HTML shell and can be extracted with mode auto without full JS execution. The funding rounds section, people cards, and acquisition history require Angular hydration and need js_rendering with js_wait_selector set to the relevant section ID. Use auto first and check what comes back before defaulting to js_rendering for every request.
Can I scrape Crunchbase discover search results?
Discover search requires an active login session and is heavily rate-limited even for authenticated Pro users. Automated access to discover search and CSV exports is explicitly restricted by Crunchbase Terms. The practical alternative is to build a seed list of organization permalinks from external sources — press releases, news mentions, LinkedIn company pages — and scrape each permalink directly rather than trying to replicate discover search programmatically.
How often do Crunchbase CSS selectors break?
Frequently. Crunchbase ships Angular component updates that change class names and DOM structure without notice. Selectors like h1.profile-name and section#funding-rounds have been relatively stable, but attribute-based selectors and deeply nested class chains break regularly. Build your pipeline with fallback selectors, monitor extraction success rates, and alert on empty results that were previously populated. Expect to update selectors several times per year.
Is Crunchbase data public domain?
No. Crunchbase aggregates, cleans, and licenses funding data. Even if individual data points like a funding announcement are public facts, Crunchbase's compiled database is protected as a copyrightable compilation in most jurisdictions. Scraping and redistributing Crunchbase data commercially — as part of a data product, API, or enrichment service — carries high legal risk regardless of whether the underlying facts are public.
What proxy type should I use for Crunchbase?
Residential US proxies. Crunchbase rate-limits datacenter IP ranges aggressively on organization page requests — you will see 429 responses or silent redirects within a small number of sequential requests from a datacenter IP. Residential proxies rotate through real ISP addresses and significantly reduce rate-limiting friction. Set proxy: "residential:us" in your OmniScrape request and keep request cadence low — one request per organization every few seconds rather than parallel bursts.
How should I model Crunchbase data in my database?
Use the organization permalink slug as the primary key for company records. Store funding rounds as separate rows keyed by the round URL slug with the organization permalink as a foreign key. Extract and store the UUID from the embedded page JSON as a secondary identifier — it survives slug changes after acquisitions or rebrands. Track a scraped_at timestamp on every record so you can identify stale data and prioritize re-scrape cycles for high-value organizations.
Related guides