Yelp Scraper: Extract Business Listings, Ratings, and Reviews

1.Data fields local SEO and reputation teams extract from Yelp

Citation monitoring is fundamentally a consistency problem: does the business name, address, and phone number on Yelp match what appears on Google Business Profile, Apple Maps, and a dozen other directories? Any mismatch is a local SEO signal worth flagging. Beyond NAP, reputation tools care about rating trends — not just the current average, but how it has moved over the last 90 days and whether negative reviews cluster around a specific product or location.

The fields below represent the full extraction target for a typical Yelp biz page. Not all fields are visible in the initial HTML — hours and amenities sometimes require interacting with expandable sections, and review text beyond the first page requires paginated requests.

Business name, Yelp biz ID, and /biz/ slug (stable identifier)
Star rating (aggregate average, 1–5 in 0.5 increments) and total review count
Street address, neighborhood label, city, state, ZIP code
Phone number (formatted and raw) and external website URL
Hours of operation for each day of the week, including holiday hours if present
Price range indicator ($ through $$$$)
Primary and secondary categories (e.g., 'Bakeries', 'Coffee & Tea')
Amenities and attributes (outdoor seating, reservations, wheelchair accessible, etc.)
Individual reviews: full text, star rating, date posted, reviewer username and profile URL
Owner response: presence, text, and response date
Claimed vs. unclaimed business status
Photos count and first-page photo URLs

2.Yelp URL patterns and pagination mechanics

Yelp's URL structure is stable and human-readable, which makes it predictable for crawlers. The primary biz slug is derived from the business name and city, lowercased and hyphenated. When two businesses share the same derived slug, Yelp appends a numeric suffix (-2, -3, etc.). The slug does not change when the business updates its name in the Yelp dashboard — the original slug persists, which is useful for long-term tracking.

Review pagination uses a simple offset query parameter rather than cursor tokens, which makes it easy to construct page URLs without first fetching a previous page. Each page returns 10 reviews. To paginate, increment start by 10 until the review container returns empty. The sort_by parameter controls ordering: date_desc gives chronological (newest first), which is most useful for incremental scraping where you only want reviews newer than your last crawl.

The biz ID — a numeric or alphanumeric identifier used in Yelp's internal systems — is embedded as a data attribute on the page and is useful when cross-referencing with the Yelp Fusion API.

Biz page: https://www.yelp.com/biz/dumpling-home-san-francisco
Duplicate slug: https://www.yelp.com/biz/dumpling-home-san-francisco-2
Reviews sorted by date: https://www.yelp.com/biz/dumpling-home-san-francisco?sort_by=date_desc
Review page 2 (offset 10): https://www.yelp.com/biz/dumpling-home-san-francisco?sort_by=date_desc&start=10
Review page 3 (offset 20): https://www.yelp.com/biz/dumpling-home-san-francisco?sort_by=date_desc&start=20
Local search: https://www.yelp.com/search?find_desc=pizza&find_loc=Brooklyn%2C+NY
Search with category filter: https://www.yelp.com/search?find_desc=Bakeries&find_loc=San+Francisco%2C+CA&cflt=bakeries
Biz ID exposed in data-biz-id attribute on the root biz container element
UK catalog: https://www.yelp.co.uk/biz/... (separate index, different review pools)

3.Yelp biz page DOM structure and CSS selectors

Yelp's frontend is a React application. The initial server-rendered HTML includes the business name, aggregate rating, address, phone, and the first 10 reviews — enough for basic extraction without JavaScript execution. However, Yelp periodically renames its CSS classes and data-testid attributes during frontend deploys, so selectors that work today may break within weeks. Anchoring selectors to data-testid attributes is more stable than class-based selectors, since testid values tend to change less frequently than generated class names.

The aggregate star rating is rendered as a div with data-testid="rating" containing an aria-label like '4 star rating' — parse the numeric value from the aria-label rather than trying to count star SVG elements. The review count appears in an anchor tag whose href contains the fragment #reviews. Address fields are split across multiple elements inside a container with data-testid="address"; concatenate the child text nodes to reconstruct the full address string.

Individual reviews are rendered as li elements inside an ordered list. Each review li contains a div with data-testid="review". Within that container: review text is in a p element with a lang attribute (e.g., lang="en"); the star rating is in a div with aria-label containing 'star rating'; the date is in a span with a generated class — look for a span whose text matches a date pattern rather than relying on class names. The reviewer's profile link is an anchor with href containing /user_details.

Pagination controls render as anchor tags. The 'next' page link contains rel="next" or can be constructed directly from the start= offset. When the start= value exceeds the total review count, the review list renders empty — use that as your termination condition.

4.Yelp bot detection and anti-scraping measures

Yelp runs a multi-layer bot detection stack. Datacenter IP ranges — AWS, GCP, Azure, DigitalOcean — are blocked or served degraded responses almost immediately. Search result pages are the most aggressively protected; even moderate request rates from residential IPs trigger CAPTCHA challenges on search. Biz pages are somewhat more permissive, but burst patterns (many requests in a short window) still trigger 403 responses or CAPTCHA interstitials.

Review pagination beyond the first page requires JavaScript execution in many cases — the review list container is present in the initial HTML but populated via an XHR call triggered after page load. If your css_extractor request returns an empty review list, switch to js_rendering mode with js_wait_selector targeting the review container. Set js_wait_timeout to at least 10–12 seconds to account for Yelp's API response latency.

Yelp's data-testid attribute names change with frontend deploys, typically every few weeks. Build selector validation into your pipeline: if the expected selector returns null for a known business, trigger an alert and re-inspect the DOM rather than silently writing empty fields to your database.

Geographically, Yelp operates separate catalogs for different countries (yelp.com, yelp.co.uk, yelp.com.au, etc.). A business listed on yelp.co.uk will not appear in yelp.com search results. Match your proxy geography to the catalog you are targeting.

Datacenter IPs blocked or rate-limited on most page types
CAPTCHA on search pages and high-volume biz page requests — use enable_solver: true
JS-rendered review pagination requires js_rendering mode for pages beyond the first
data-testid attribute names change with frontend deploys — monitor selector health
Geo-partitioned catalogs: yelp.com, yelp.co.uk, yelp.com.au are separate indexes
TLS fingerprinting and browser behavior signals used for bot classification
ToS Section 7 explicitly prohibits scraping reviews for competing directory products

5.Scrape a Yelp business page with CSS extraction

For a single biz page, mode 'auto' with a residential US proxy is the right starting point. The initial server-rendered HTML contains the business name, rating, address, phone, website, price range, and categories — all extractable without JavaScript execution. OmniScrape's auto mode tries a fast HTTP request first and escalates to headless browser rendering only if the response indicates a challenge or missing content, which keeps costs low for pages that serve full HTML.

Use css_extractor output format with explicit selectors for each field. The response will include a css_extracted map with your field names as keys. Check body.data.css_extracted in the response — if a field is null or empty, the selector may have changed and needs updating. The metadata.method_used field tells you whether the request was served via fast HTTP or js_rendering, which helps you understand Yelp's current response behavior for that URL.

Residential proxy geo-matching matters: a US proxy in the same metro as the business tends to get more complete results, particularly for hours and attributes that Yelp may personalize by region.

Yelp biz page — CSS extraction request

json

12345678910111213141516171819{
  "url": "https://www.yelp.com/biz/tartine-bakery-san-francisco",
  "mode": "auto",
  "output_format": "css_extractor",
  "enable_solver": true,
  "proxy": "residential:us",
  "css_selectors": {
    "name": "h1",
    "rating": "[data-testid=\"rating\"]",
    "review_count": "a[href*=\"#reviews\"]",
    "address": "[data-testid=\"address\"]",
    "phone": "[data-testid=\"phone-number\"]",
    "website": "[data-testid=\"biz-website-link\"]",
    "price_range": "[data-testid=\"price-range\"]",
    "categories": "span[class*=\"category-str-list\"]",
    "hours_table": "table[class*=\"hours-table\"]",
    "claimed_status": "[data-testid=\"claimed-status\"]"
  }
}

6.Paginate and extract Yelp reviews

Review pagination uses the start= query parameter with increments of 10. For incremental scraping (only new reviews since last run), use sort_by=date_desc and stop pagination when you encounter a review date older than your last crawl timestamp — this avoids fetching the full review history on every run.

Reviews beyond the first page reliably require JavaScript execution. Use js_rendering mode with js_wait_selector set to the review container. If the selector does not appear within js_wait_timeout milliseconds, OmniScrape returns whatever HTML was available — check that css_extracted.review_text is non-empty before writing to your store.

Space requests at least 3–5 seconds apart per business. For bulk crawls across many businesses, distribute requests across multiple sessions using session_id to avoid pattern detection from a single IP making sequential requests to the same domain.

Yelp reviews — paginated JS-rendered request

json

12345678910111213141516{
  "url": "https://www.yelp.com/biz/tartine-bakery-san-francisco?sort_by=date_desc&start=10",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "enable_solver": true,
  "proxy": "residential:us",
  "js_wait_selector": "[data-testid=\"review\"]",
  "js_wait_timeout": 12000,
  "css_selectors": {
    "review_text": "[data-testid=\"review\"] p[lang]",
    "review_rating": "[data-testid=\"review\"] div[aria-label*=\"star rating\"]",
    "review_date": "[data-testid=\"review\"] span[class*=\"date\"]",
    "reviewer_name": "[data-testid=\"review\"] a[href*=\"/user_details\"]",
    "owner_response": "[data-testid=\"owner-response\"]"
  }
}

7.NAP normalization, deduplication, and cross-source merging

NAP (Name, Address, Phone) reconciliation is the core use case for multi-source local data pipelines. Raw Yelp data arrives inconsistently formatted: phone numbers may be '(415) 555-0100', '+14155550100', or '415.555.0100' depending on what the business owner entered. Normalize all phone numbers to E.164 format (+14155550100) before storage and comparison. For addresses, use a parser like libpostal to decompose free-text addresses into structured components (street number, street name, city, state, postal code) — this enables reliable matching even when abbreviations differ ('St' vs 'Street', 'Ave' vs 'Avenue').

Store the Yelp biz slug as the primary key for Yelp records, not the business name — names change, slugs persist. When merging Yelp records with Google Maps data, use a two-stage match: first attempt an exact match on normalized phone + ZIP code, then fall back to fuzzy name matching (Jaro-Winkler or trigram similarity) within the same ZIP code. Require a similarity threshold of at least 0.85 before auto-merging; queue lower-confidence matches for manual review.

Track NAP discrepancies as structured diffs: {field: 'phone', yelp: '+14155550100', google: '+14155550199', business_id: '...'}. This format makes it easy to generate citation audit reports and to detect when a business has updated its information on one platform but not others.

See lead generation web scraping for broader enrichment pipeline patterns that apply to local business data.

8.Yelp Fusion API — when to use it instead of scraping

Yelp operates a first-party API called Fusion API, available at api.yelp.com. The free tier provides access to business search, business details, and reviews (capped at 3 reviews per business on the public tier). Fusion is the correct choice when you are building a consumer-facing product that displays Yelp data with attribution, or when your use case falls within Yelp's developer terms — the API provides structured JSON responses with stable field names and no bot-detection friction.

Fusion's limitations are meaningful for data-intensive use cases: the free tier has rate limits that make bulk extraction impractical, review text is truncated, and access to full review history is not available. The API also does not expose some fields visible on the biz page, such as detailed amenity attributes and owner responses. For internal analytics, citation monitoring of your own business listings, or research use cases that do not involve republishing Yelp content, scraping the biz page gives you more complete data.

The two approaches are not mutually exclusive. A common pattern is to use Fusion for initial business discovery and structured metadata (categories, coordinates, Yelp rating), then scrape the biz page for fields Fusion does not expose (full review text, owner responses, amenity details). Always check the current Fusion API terms before combining approaches — Yelp updates its developer policies periodically.

9.Yelp Terms of Service and legal considerations

Yelp's Terms of Service, Section 7 (Prohibited Activities), explicitly prohibits using automated means to access the site, scraping content, or using Yelp data to build competing products or populate other directories. Yelp has a history of litigation against scrapers — most notably the hiQ Labs v. LinkedIn precedent is often cited in this context, though that case involved LinkedIn and its outcome does not provide blanket protection for scraping Yelp.

The compliance picture depends heavily on use case. Monitoring your own business's Yelp listing for citation accuracy or review alerts is a common practice and low legal risk, though technically still outside the letter of the ToS. Building a competing local directory populated with Yelp reviews is the highest-risk use case and the one Yelp has historically pursued legally. Academic research and journalism occupy a grayer middle ground.

Do not republish Yelp review text verbatim in customer-facing products without explicit permission. If you are building a product that displays local business data, evaluate the Yelp Fusion API with proper attribution as the compliant path. For internal analytics where data is not republished, assess your specific use case with legal counsel familiar with the CFAA and relevant state computer fraud statutes.

Frequently asked questions

How do I paginate through all Yelp reviews for a business?

Use the start= query parameter in increments of 10: ?sort_by=date_desc&start=0, then start=10, start=20, and so on. Stop when the review container in the response is empty — this means you have exceeded the total review count. For incremental scraping, sort by date_desc and halt pagination when you encounter a review older than your last crawl timestamp, rather than always fetching from the beginning. Space requests at least 3–5 seconds apart to avoid triggering rate limits.

Why are reviews missing from my Yelp scrape response?

Review content beyond the first page is loaded via JavaScript after initial page render. If you are using mode 'auto' or 'fast' and the review list is empty, switch to mode 'js_rendering' with js_wait_selector set to '[data-testid="review"]' and js_wait_timeout of at least 10000–12000ms. Also verify that your css_selectors are current — Yelp's data-testid attribute names change with frontend deploys, so selectors that worked last month may be stale.

What proxy type should I use for Yelp?

Residential proxies are required for reliable Yelp access. Datacenter IP ranges (AWS, GCP, DigitalOcean) are blocked or served CAPTCHA challenges almost immediately. Use proxy: 'residential:us' for US Yelp listings. For geo-specific catalogs (yelp.co.uk, yelp.com.au), match the proxy country to the catalog — 'residential:gb' for UK, 'residential:au' for Australia. Metro-level geo-matching (e.g., a California residential IP for San Francisco businesses) can improve response completeness for localized content.

How do I handle Yelp CAPTCHA challenges?

Set enable_solver: true in your OmniScrape request. OmniScrape's Web Unlocker handles CAPTCHA solving automatically. Check metadata.solver_used and metadata.challenge_solved in the response to confirm the challenge was resolved. If you are seeing persistent CAPTCHA on search pages, consider targeting biz URLs directly rather than search — search pages have significantly more aggressive bot detection than individual biz pages.

How do I merge Yelp data with Google Maps data for NAP reconciliation?

Normalize both sources before merging: convert phone numbers to E.164 format and parse addresses with libpostal into structured components. Use a two-stage match: exact match on normalized phone + ZIP code first, then fall back to fuzzy name similarity (Jaro-Winkler, threshold 0.85+) within the same ZIP. Store the Yelp biz slug and Google Place ID as separate keys — do not use business name as a primary key since names change. See Google Maps scraper for the Maps-side extraction patterns.

Can I use Yelp Fusion API instead of scraping?

Fusion API is the right choice for consumer-facing products that display Yelp data with attribution, and for use cases within Yelp's developer terms. Its limitations: the free tier caps reviews at 3 per business, review text is truncated, and bulk extraction is rate-limited. Fusion does not expose amenity attributes, owner responses, or full review history. For internal analytics or citation monitoring of your own listings, scraping the biz page gives more complete data — but evaluate your use case against Yelp's developer terms before combining approaches.

How often do Yelp's CSS selectors and data-testid attributes change?

Yelp's frontend deploys frequently — data-testid attribute names and generated CSS classes can change every few weeks. Build selector health monitoring into your pipeline: after each crawl run, validate that key fields (name, rating, review_count) are non-null for a set of known test businesses. If any sentinel field returns null, trigger an alert and re-inspect the live DOM before the next production run. Anchoring to data-testid attributes is more stable than generated class names, but neither is immune to changes.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.Data fields local SEO and reputation teams extract from Yelp

Business name, Yelp biz ID, and /biz/ slug (stable identifier)
Star rating (aggregate average, 1–5 in 0.5 increments) and total review count
Street address, neighborhood label, city, state, ZIP code
Phone number (formatted and raw) and external website URL
Hours of operation for each day of the week, including holiday hours if present
Price range indicator ($ through $$$$)
Primary and secondary categories (e.g., 'Bakeries', 'Coffee & Tea')
Amenities and attributes (outdoor seating, reservations, wheelchair accessible, etc.)
Individual reviews: full text, star rating, date posted, reviewer username and profile URL
Owner response: presence, text, and response date
Claimed vs. unclaimed business status
Photos count and first-page photo URLs

2.Yelp URL patterns and pagination mechanics

The biz ID — a numeric or alphanumeric identifier used in Yelp's internal systems — is embedded as a data attribute on the page and is useful when cross-referencing with the Yelp Fusion API.

Biz page: https://www.yelp.com/biz/dumpling-home-san-francisco
Duplicate slug: https://www.yelp.com/biz/dumpling-home-san-francisco-2
Reviews sorted by date: https://www.yelp.com/biz/dumpling-home-san-francisco?sort_by=date_desc
Review page 2 (offset 10): https://www.yelp.com/biz/dumpling-home-san-francisco?sort_by=date_desc&start=10
Review page 3 (offset 20): https://www.yelp.com/biz/dumpling-home-san-francisco?sort_by=date_desc&start=20
Local search: https://www.yelp.com/search?find_desc=pizza&find_loc=Brooklyn%2C+NY
Search with category filter: https://www.yelp.com/search?find_desc=Bakeries&find_loc=San+Francisco%2C+CA&cflt=bakeries
Biz ID exposed in data-biz-id attribute on the root biz container element
UK catalog: https://www.yelp.co.uk/biz/... (separate index, different review pools)

3.Yelp biz page DOM structure and CSS selectors

4.Yelp bot detection and anti-scraping measures

Datacenter IPs blocked or rate-limited on most page types
CAPTCHA on search pages and high-volume biz page requests — use enable_solver: true
JS-rendered review pagination requires js_rendering mode for pages beyond the first
data-testid attribute names change with frontend deploys — monitor selector health
Geo-partitioned catalogs: yelp.com, yelp.co.uk, yelp.com.au are separate indexes
TLS fingerprinting and browser behavior signals used for bot classification
ToS Section 7 explicitly prohibits scraping reviews for competing directory products

5.Scrape a Yelp business page with CSS extraction

Residential proxy geo-matching matters: a US proxy in the same metro as the business tends to get more complete results, particularly for hours and attributes that Yelp may personalize by region.

Yelp biz page — CSS extraction request

json

12345678910111213141516171819{
  "url": "https://www.yelp.com/biz/tartine-bakery-san-francisco",
  "mode": "auto",
  "output_format": "css_extractor",
  "enable_solver": true,
  "proxy": "residential:us",
  "css_selectors": {
    "name": "h1",
    "rating": "[data-testid=\"rating\"]",
    "review_count": "a[href*=\"#reviews\"]",
    "address": "[data-testid=\"address\"]",
    "phone": "[data-testid=\"phone-number\"]",
    "website": "[data-testid=\"biz-website-link\"]",
    "price_range": "[data-testid=\"price-range\"]",
    "categories": "span[class*=\"category-str-list\"]",
    "hours_table": "table[class*=\"hours-table\"]",
    "claimed_status": "[data-testid=\"claimed-status\"]"
  }
}

6.Paginate and extract Yelp reviews

Yelp reviews — paginated JS-rendered request

json

12345678910111213141516{
  "url": "https://www.yelp.com/biz/tartine-bakery-san-francisco?sort_by=date_desc&start=10",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "enable_solver": true,
  "proxy": "residential:us",
  "js_wait_selector": "[data-testid=\"review\"]",
  "js_wait_timeout": 12000,
  "css_selectors": {
    "review_text": "[data-testid=\"review\"] p[lang]",
    "review_rating": "[data-testid=\"review\"] div[aria-label*=\"star rating\"]",
    "review_date": "[data-testid=\"review\"] span[class*=\"date\"]",
    "reviewer_name": "[data-testid=\"review\"] a[href*=\"/user_details\"]",
    "owner_response": "[data-testid=\"owner-response\"]"
  }
}

7.NAP normalization, deduplication, and cross-source merging

See lead generation web scraping for broader enrichment pipeline patterns that apply to local business data.

8.Yelp Fusion API — when to use it instead of scraping

9.Yelp Terms of Service and legal considerations

Frequently asked questions

How do I paginate through all Yelp reviews for a business?

Why are reviews missing from my Yelp scrape response?

What proxy type should I use for Yelp?

How do I handle Yelp CAPTCHA challenges?

How do I merge Yelp data with Google Maps data for NAP reconciliation?

Can I use Yelp Fusion API instead of scraping?

How often do Yelp's CSS selectors and data-testid attributes change?

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.