Real Estate Web Scraping: Listings, Comps, and Market Data

1.End-to-end industry workflow

The standard proptech ingestion loop runs on two cadences. Daily: pull new listing IDs from portal XML sitemaps and saved-search feed URLs, enqueue detail page URLs, fetch each via OmniScrape, geocode the normalized address, run a PostGIS point-in-polygon filter against your target metro boundaries, and append the result to a time-series snapshot table. That table drives days-on-market counters and price-cut detection without any additional API calls.

Weekly (or on-demand): map-search pages that don't expose sitemaps require a browser pass. Use mode 'js_rendering' with a js_wait_selector targeting result cards. Extract listing hrefs from the rendered DOM, then feed those URLs back into the daily detail-page queue. This keeps the expensive browser renders to a minimum — sitemap-sourced URLs are always cheaper and more stable when available.

On status change: when a listing flips to off-market, sold, or rented, write the terminal snapshot and stop polling. Continuing to fetch sold listings wastes credits and inflates your active-inventory counts.

2.Listing data schema

Never use the display address string as a primary key — two portals will format the same address differently, and a unit number swap will create phantom duplicates. Normalize with libpostal or a vendor geocoder and store both the raw and normalized forms. The geocoder_confidence field gates rows before they reach your comp model: anything below 0.85 on a critical metro should be quarantined for manual review.

Store prior_price_usd alongside list_price_usd so price-cut queries are a single-row read rather than a self-join. photo_urls should be stored as references only — never hotlink portal CDN URLs in customer-facing applications without a licensing agreement, and expect them to 404 within days as CDN tokens expire.

The listing_type field ('sale' vs 'rental') must be set at extract time from URL path patterns or an on-page badge selector. Do not infer it from the presence of list_price_usd vs rent_monthly_usd — some portals show both on the same listing for rent-to-own products.

listing fact row (SCD Type 2 snapshot)

json

1234567891011121314151617181920212223242526{
  "portal": "zillow_style_portal",
  "listing_id": "MLS-2048812",
  "address_raw": "142 Oak St, Austin, TX 78701",
  "address_normalized": "142 Oak Street, Austin, TX 78701",
  "lat": 30.2672,
  "lon": -97.7431,
  "geocoder_confidence": 0.97,
  "list_price_usd": 485000,
  "prior_price_usd": 499000,
  "rent_monthly_usd": null,
  "beds": 3,
  "baths": 2.5,
  "sqft": 1840,
  "lot_sqft": 5400,
  "year_built": 1998,
  "listing_type": "sale",
  "status": "active",
  "days_on_market": 14,
  "price_cut_count": 1,
  "first_seen_at": "2026-05-01T00:00:00Z",
  "status_changed_at": null,
  "scraped_at": "2026-06-23T07:00:00Z",
  "portal_url": "https://portal.example/listing/2048812",
  "photo_urls": ["https://cdn.portal.example/photos/2048812/1.jpg"]
}

3.OmniScrape API request examples

mode 'auto' tries a fast HTTP fetch first and escalates to a headless browser automatically if the portal returns a bot challenge or the CSS selectors come back empty. This means you pay browser-tier credits only on pages that genuinely need them — typically 20–40% of requests on major US portals, depending on your IP reputation and the portal's bot-detection aggressiveness.

enable_solver: true activates the Web Unlocker layer, which handles CAPTCHA challenges and JavaScript fingerprinting. Pair it with proxy: 'residential:us' to match the geo-market of the listing — portals frequently return geo-gated results or empty map tiles when they detect a mismatch between the IP location and the market being queried.

The extracted fields arrive in body.data.css_extracted as a key-value map matching your css_selectors keys. Numeric fields like price and sqft will be strings at this stage — strip currency symbols and commas before casting to int/float in your transform layer.

For map search result pages where listing cards render client-side, use mode 'js_rendering' directly and wait for the result container:

listing detail page — css_extractor

json

12345678910111213141516171819{
  "url": "https://portal.example/listing/2048812",
  "mode": "auto",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "enable_solver": true,
  "css_selectors": {
    "price": "[data-testid='list-price']",
    "beds": "[data-testid='bed-count']",
    "baths": "[data-testid='bath-count']",
    "sqft": "[data-testid='sqft']",
    "address": "h1[data-testid='address']",
    "status": "[data-testid='listing-status']",
    "days_on_market": "[data-testid='dom-badge']",
    "listing_type": "[data-testid='listing-type-badge']"
  },
  "js_wait_selector": "[data-testid='list-price']",
  "js_wait_timeout": 12000
}

map search results — extract listing hrefs

json

123456789101112{
  "url": "https://portal.example/search?city=Austin&type=sale&zoom=12",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "enable_solver": true,
  "css_selectors": {
    "listing_links": ".map-search-result-card a[href*='/listing/']"
  },
  "js_wait_selector": ".map-search-result-card",
  "js_wait_timeout": 15000
}

4.Pipeline architecture

A production real estate pipeline has five logical stages, each with a clear failure mode to instrument:

1. ID discovery — Portal XML sitemaps, saved-search RSS/JSON feeds, and map-search crawls produce a stream of listing URLs. Deduplicate by URL before enqueuing. Sitemaps are the preferred source: they're cheaper to fetch, more complete, and don't require browser rendering.

2. Detail extraction — OmniScrape fetches each listing URL with css_extractor. The response body.data.css_extracted map feeds directly into your transform layer. Log body.metadata.method_used to track how often 'auto' escalates to 'js_rendering' — a sudden spike indicates a portal has tightened its bot detection.

3. Geocode and filter — Normalize the raw address string, geocode to lat/lon, and run a PostGIS ST_Within check against your target metro polygons. Rows outside the polygon or below confidence threshold go to a quarantine table, not the main fact table. This keeps your comp models clean without discarding data permanently.

4. Snapshot append (SCD Type 2) — Every successful fetch writes a new row with scraped_at. Never update in place. This gives you a full price history and days-on-market calculation as a simple MAX(scraped_at) - MIN(first_seen_at) per listing_id. Partition the snapshot table by scraped_at month for cheap time-travel queries.

5. Downstream delivery — Price-cut alerts (prior_price_usd > list_price_usd on latest snapshot) fire to broker webhooks. Aggregated metrics roll up to a broker dashboard or CSV export. Photo URLs are stored as references and served via a signed proxy that checks CDN freshness before returning to the client.

5.Deduplication and MLS conflicts

The same physical property will appear on multiple portals with different listing IDs, slightly different addresses, and sometimes conflicting data. A property listed by an agent on the MLS may surface on Zillow, Realtor.com, Homes.com, and a regional portal simultaneously — each with its own internal ID.

Primary dedup key: normalized_address + beds + sqft with a fuzzy match threshold (Levenshtein distance ≤ 2 on address, ±5% on sqft). Do not use listing_id alone — it's portal-scoped. Do not use raw address — formatting diverges. A canonical_id generated from the geocoded lat/lon rounded to ~10m precision (5 decimal places) works well as a cross-portal merge key.

MLS licensing is a separate concern from technical deduplication. In the US, RESO-standard MLS feeds are licensed products. Many portals that display MLS data have agreements that prohibit downstream scraping or redistribution. Legal review before production is not optional — your counsel owns the permissible-use question, not your engineering team.

For rental portals (apartments, Craigslist-style), dedup is harder because landlords post the same unit repeatedly with new listing IDs. A composite key of (normalized_address, unit_number, beds, rent_monthly_usd) with a 30-day dedup window handles most cases.

6.Sale vs rental classification

Mixing sale listings into a rental comp model — or vice versa — is one of the most common data quality failures in proptech pipelines. The damage is subtle: rent estimates drift upward because $485,000 sale prices get averaged with $2,100/month rents after a unit-conversion bug.

Set listing_type at extract time using two independent signals and require both to agree before writing the row. Signal one: URL path pattern (e.g., '/for-sale/' vs '/for-rent/' vs '/apartments/'). Signal two: on-page badge selector (e.g., '[data-testid=listing-type-badge]'). If they disagree, route the row to a classification review queue rather than guessing.

Some portals serve mixed search results — a 'homes' search page that includes both for-sale and for-rent listings in the same result set. In this case, extract listing_type from each card individually using a per-card selector, not a page-level heuristic.

Rent-to-own and lease-option listings are genuinely ambiguous — they have both a purchase price and a monthly payment. Tag them as 'rent_to_own' in listing_type and exclude them from both pure-sale and pure-rental comp models unless your product explicitly supports them.

7.Pipeline and product metrics

Brokers and analysts care most about detection speed and price-cut alerts — optimize those two metrics before optimizing raw volume or coverage breadth. A pipeline that catches every price cut within 6 hours across 3 metros is more valuable than one that covers 20 metros with 48-hour latency.

New listing detection latency — hours from portal publish timestamp to first row in your snapshot table. Target under 24h for active metros; under 4h for premium broker tiers.
Days-on-market accuracy — spot-check a sample of closed listings against MLS ground truth or broker records. A 10% sample monthly is sufficient for most use cases.
Duplicate listing rate — percentage of canonical_ids with more than one active source portal. High rates indicate your merge key needs tuning.
Geo rejection rate — percentage of extracted rows that fail the point-in-polygon filter. Sudden spikes indicate a portal changed its URL structure and you're fetching out-of-market pages.
Extraction success rate — percentage of OmniScrape requests where all required CSS selectors return non-empty values. Track per portal and per selector key to catch DOM changes early.
Photo URL 404 rate — CDN links expire; track this weekly and stop storing photo URLs from portals where >30% expire within 7 days.
Cost per metro per month — credits consumed by portal × metro. Identifies which portals are worth the browser-rendering overhead vs cheaper sitemap-only approaches.
Price-cut alert delivery latency — time from price change detection to broker webhook delivery. Brokers rank this as the highest-value signal in most market conditions.

8.Historical price snapshots

Store list_price_usd on every snapshot even when the listing status stays 'active' and nothing else changes. Price cuts are the primary signal for buyer intent modeling, automated valuation model calibration, and broker alert products. A listing that drops from $499,000 to $485,000 after 21 days on market tells a very different story than one that holds price for 60 days.

Use scraped_at-partitioned append-only tables rather than update-in-place. This gives you free time-travel: 'what was the median list price in this zip code 90 days ago?' is a simple partition-filtered aggregate, not a slowly changing dimension join. In BigQuery or Redshift, monthly partitioning on scraped_at keeps query costs low even at hundreds of millions of rows.

Reconstruct the price history for any listing as: SELECT scraped_at, list_price_usd FROM snapshots WHERE canonical_id = ? ORDER BY scraped_at. The first row where list_price_usd differs from the prior row is a price event. Materialize these price events into a separate price_history table for fast broker dashboard queries.

For sold/rented listings, the terminal snapshot (status = 'sold' or 'rented') is the most valuable row — it anchors your comp model with an actual transaction price rather than an ask price. Preserve terminal snapshots indefinitely even if you purge intermediate active-status rows after 12 months to manage storage costs.

9.JavaScript map search scraping

Map-based search UIs load listing pins and cards entirely client-side. The server returns a shell HTML page; the actual listing data arrives via XHR/fetch calls triggered by the map viewport. Standard HTTP scraping returns an empty result set — you need a headless browser that executes the page's JavaScript.

Use mode 'js_rendering' with js_wait_selector set to the CSS selector of the result cards container. Set js_wait_timeout to at least 12,000ms — map tiles and listing data often load in two sequential XHR rounds, and a 5s timeout will catch the first round but miss the second.

Extract listing hrefs from the rendered DOM using output_format 'css_extractor' with a selector targeting anchor tags inside result cards. Enqueue those URLs into your standard detail-page queue — don't try to parse listing data from the search result cards directly, as they typically show truncated fields.

Pagination on map search is viewport-based, not page-number-based. To cover a metro, divide the bounding box into a grid of overlapping tiles, fetch each tile's search URL, and deduplicate listing IDs across tiles. Tile size is a tradeoff: smaller tiles return fewer listings per request (more requests, more cost) but avoid the portal's per-search result cap (typically 200–500 listings).

For portals where sitemaps are complete and fresh, skip map search entirely — sitemap-sourced URLs are cheaper (HTTP-only in most cases), more stable across portal redesigns, and don't require viewport grid math. Use map search only when sitemaps lag new listings by more than your acceptable detection latency.

10.Legal and data governance

The technical ability to fetch a page does not determine whether doing so is permissible. Most major real estate portals have terms of service that prohibit automated scraping, data redistribution, or commercial use of listing data without a licensing agreement. Some of this data is also governed by MLS rules that carry contractual and legal weight independent of the portal's own terms.

OmniScrape provides technical access to publicly accessible web pages — it is not a legal opinion on permissible use. Your legal counsel must review the terms of each portal you intend to scrape, the jurisdiction you operate in, and the downstream use of the data (internal analytics vs. customer-facing product vs. data resale).

Practical governance steps for proptech teams: maintain a portal registry with ToS review date and approved-use scope per portal; implement data retention limits that match your legal review (e.g., purge raw HTML after 30 days, keep extracted fields per your data agreement); never store or serve portal photos without a CDN licensing agreement; and log every scrape request with timestamp, URL, and the business purpose that authorized it.

Licensed MLS data feeds (RESO Web API, RETS) are the standard path for US residential listing data in production broker and AVM products. HTML scraping is more common for rental portals, commercial listings, international markets, and supplementary data signals (price history, days-on-market cross-validation) where licensed feeds are unavailable or prohibitively expensive.

Frequently asked questions

Can I scrape Zillow, Realtor.com, or similar major portals?

Each portal's terms of service governs what's permitted. Most major US residential portals explicitly prohibit automated scraping and commercial use of listing data without a license. Many proptech products use licensed RESO/MLS feeds for US residential data and reserve HTML scraping for supplementary signals, rental portals, or international markets where licensed feeds don't exist. Get legal review before building a production pipeline on any major portal.

Why do I need residential proxies for real estate portals?

Major portals geo-gate results and aggressively block datacenter IP ranges. A datacenter IP requesting Austin listings may get empty map results, a CAPTCHA wall, or a geo-redirected response. Setting proxy: 'residential:us' in your OmniScrape request routes through IPs that match the market you're querying, which dramatically reduces empty-result rates on map search pages and reduces bot-detection escalations on detail pages.

How often should I refresh active listings?

For active listings in competitive metros, daily refreshes are the minimum for price-cut detection. If your broker product promises same-day alerts, you need 6–12 hour polling cycles on high-velocity markets. For sold or rented listings, stop polling as soon as the status flips — write the terminal snapshot and mark the listing inactive in your queue. Continuing to poll off-market listings wastes credits and adds noise to your active-inventory counts.

What if the listing price or availability only loads via XHR, not in the initial HTML?

Set mode to 'js_rendering' and use js_wait_selector targeting the price element — the headless browser will execute the page's JavaScript including the XHR calls and wait until the selector appears in the DOM before extracting. If the price exists only in a network response JSON and never renders into the DOM at all (rare but possible with some single-page apps), you'll need to intercept the XHR response. See JavaScript rendered pages for that pattern.

How do I validate that geocodes are accurate enough for comp models?

Use geocoder_confidence scores and reject rows below your threshold (0.85 is a reasonable starting point for street-level precision). Run a PostGIS ST_Within check against your target metro polygon — a geocode that places a property outside your licensed market boundary should be quarantined, not silently included. Spot-check 1–2% of geocoded rows monthly against a known-good address dataset for the metros where your model is most sensitive to location accuracy.

How do I handle the same listing appearing on multiple portals?

Build a canonical_id from geocoded coordinates rounded to 5 decimal places (~1m precision) combined with beds and sqft within a ±5% tolerance. This cross-portal merge key lets you pick the freshest or most complete record per property without relying on portal-specific listing IDs. Store all portal records and the canonical_id in your snapshot table — don't discard portal records, as different portals sometimes have different fields (one may have HOA fees, another may have school district data).

What's the right way to detect price cuts in my pipeline?

In your snapshot table, compare list_price_usd on the current scrape against the most recent prior snapshot for the same canonical_id. If current < prior, write a row to a price_events table with the delta and percentage change, then trigger your alert downstream. Materialize this as a streaming or micro-batch job rather than a daily batch if your broker SLA requires same-day alerts. Avoid updating the snapshot row in place — the append-only pattern is what makes time-travel queries and audit trails possible.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.End-to-end industry workflow

2.Listing data schema

listing fact row (SCD Type 2 snapshot)

json

1234567891011121314151617181920212223242526{
  "portal": "zillow_style_portal",
  "listing_id": "MLS-2048812",
  "address_raw": "142 Oak St, Austin, TX 78701",
  "address_normalized": "142 Oak Street, Austin, TX 78701",
  "lat": 30.2672,
  "lon": -97.7431,
  "geocoder_confidence": 0.97,
  "list_price_usd": 485000,
  "prior_price_usd": 499000,
  "rent_monthly_usd": null,
  "beds": 3,
  "baths": 2.5,
  "sqft": 1840,
  "lot_sqft": 5400,
  "year_built": 1998,
  "listing_type": "sale",
  "status": "active",
  "days_on_market": 14,
  "price_cut_count": 1,
  "first_seen_at": "2026-05-01T00:00:00Z",
  "status_changed_at": null,
  "scraped_at": "2026-06-23T07:00:00Z",
  "portal_url": "https://portal.example/listing/2048812",
  "photo_urls": ["https://cdn.portal.example/photos/2048812/1.jpg"]
}

3.OmniScrape API request examples

For map search result pages where listing cards render client-side, use mode 'js_rendering' directly and wait for the result container:

listing detail page — css_extractor

json

12345678910111213141516171819{
  "url": "https://portal.example/listing/2048812",
  "mode": "auto",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "enable_solver": true,
  "css_selectors": {
    "price": "[data-testid='list-price']",
    "beds": "[data-testid='bed-count']",
    "baths": "[data-testid='bath-count']",
    "sqft": "[data-testid='sqft']",
    "address": "h1[data-testid='address']",
    "status": "[data-testid='listing-status']",
    "days_on_market": "[data-testid='dom-badge']",
    "listing_type": "[data-testid='listing-type-badge']"
  },
  "js_wait_selector": "[data-testid='list-price']",
  "js_wait_timeout": 12000
}

map search results — extract listing hrefs

json

123456789101112{
  "url": "https://portal.example/search?city=Austin&type=sale&zoom=12",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "enable_solver": true,
  "css_selectors": {
    "listing_links": ".map-search-result-card a[href*='/listing/']"
  },
  "js_wait_selector": ".map-search-result-card",
  "js_wait_timeout": 15000
}

4.Pipeline architecture

A production real estate pipeline has five logical stages, each with a clear failure mode to instrument:

5.Deduplication and MLS conflicts

6.Sale vs rental classification

7.Pipeline and product metrics

New listing detection latency — hours from portal publish timestamp to first row in your snapshot table. Target under 24h for active metros; under 4h for premium broker tiers.
Days-on-market accuracy — spot-check a sample of closed listings against MLS ground truth or broker records. A 10% sample monthly is sufficient for most use cases.
Duplicate listing rate — percentage of canonical_ids with more than one active source portal. High rates indicate your merge key needs tuning.
Geo rejection rate — percentage of extracted rows that fail the point-in-polygon filter. Sudden spikes indicate a portal changed its URL structure and you're fetching out-of-market pages.
Extraction success rate — percentage of OmniScrape requests where all required CSS selectors return non-empty values. Track per portal and per selector key to catch DOM changes early.
Photo URL 404 rate — CDN links expire; track this weekly and stop storing photo URLs from portals where >30% expire within 7 days.
Cost per metro per month — credits consumed by portal × metro. Identifies which portals are worth the browser-rendering overhead vs cheaper sitemap-only approaches.
Price-cut alert delivery latency — time from price change detection to broker webhook delivery. Brokers rank this as the highest-value signal in most market conditions.

8.Historical price snapshots

9.JavaScript map search scraping

10.Legal and data governance

Frequently asked questions

Can I scrape Zillow, Realtor.com, or similar major portals?

Why do I need residential proxies for real estate portals?

How often should I refresh active listings?

What if the listing price or availability only loads via XHR, not in the initial HTML?

How do I validate that geocodes are accurate enough for comp models?

How do I handle the same listing appearing on multiple portals?

What's the right way to detect price cuts in my pipeline?

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.