OmniScrape
ProductsSolutionsGuidesDocs ↗PricingAbout
ProductsSolutionsGuidesDocs ↗PricingAbout
← All guides
Web Scraping by Language

Web Scraping with R: httr2, rvest, and the OmniScrape API

R is the natural home for data once it arrives — ggplot2, dplyr, lme4, Stan. The challenge is getting structured data there without leaving RStudio to run a Python script or manually export CSVs. rvest bridges that gap by exposing HTML as a tree you can query with CSS selectors, producing output that flows directly into tibbles and dplyr pipelines. httr2, the modern replacement for the aging httr package, gives you a composable, pipe-friendly HTTP client with proper timeout handling, retry logic, and request body helpers.

The workflow breaks down fast when a government statistics portal, academic journal supplement, or financial data vendor adds Cloudflare protection or serves tables via JavaScript fetch calls. Rather than maintaining a Selenium or Playwright sidecar process, you can route those requests through the OmniScrape API and pipe the returned HTML straight into read_html() — your rvest selectors stay identical. For a scripting-language comparison, see web scraping with Python.

On this page

1. Install and load packages2. Fetch a page with httr23. Extract tables and nodes with rvest4. Clean and persist data as a tibble5. When resp_body_string returns a challenge page6. httr2 + OmniScrape for protected pages7. Server-side CSS extraction for structured fields8. JavaScript-rendered tables and SPAs9. Reproducible research habits for scraped data10. Handle API failures and HTTP errors in R11. FAQ

1.Install and load packages

Three packages cover the full workflow: httr2 handles all HTTP concerns (request building, authentication headers, retries, timeouts), rvest handles HTML parsing and CSS-selector-based extraction, and jsonlite provides explicit JSON serialisation when you need fine-grained control over how lists are converted. All three are on CRAN and install without system dependencies on Linux, macOS, and Windows.

If you are working in a reproducible research context — Rmd, Quarto, or a packaged analysis — pin package versions with renv::snapshot() after installation so collaborators and CI reproduce the same environment.

R console
r
12345install.packages(c("httr2", "rvest", "jsonlite", "dplyr", "readr", "purrr"))

# Verify versions
packageVersion("httr2")   # >= 1.0.0 recommended
packageVersion("rvest")   # >= 1.0.3

2.Fetch a page with httr2

httr2's pipe-based API lets you compose a request incrementally before executing it. req_user_agent() sets a descriptive agent string — many sites return different content or block requests with the default libcurl agent. req_timeout() prevents hung connections from stalling a batch job. req_perform() executes the request and returns a response object; resp_body_string() materialises the body as a character vector.

For multi-page crawls, wrap req_perform() in purrr::map() with req_throttle() to respect crawl delays. Check resp_status() before parsing — a 200 with a bot-challenge body is more dangerous than an explicit 403 because rvest will parse the challenge page silently.

fetch.R
r
1234567891011121314151617library(httr2)

resp <- request("https://books.toscrape.com/catalogue/page-1.html") |>
  req_user_agent("RScraperBot/1.0 (+https://yourlab.example.com)") |>
  req_timeout(30) |>
  req_error(is_error = \(resp) FALSE) |>   # handle errors manually
  req_perform()

status <- resp_status(resp)
cat("HTTP status:", status, "\n")

if (status != 200L) {
  stop("Unexpected status: ", status)
}

html <- resp_body_string(resp)
cat("Fetched", nchar(html), "characters\n")

3.Extract tables and nodes with rvest

read_html() accepts a character string of raw HTML and returns an xml_document object. html_elements() applies a CSS selector and returns a nodeset; html_element() returns the first match within a given context node. html_text2() is preferred over html_text() because it collapses whitespace the way a browser would, stripping leading and trailing space and collapsing internal runs.

html_table() works well on government statistical tables that use proper thead/tbody markup. It struggles with tables that use CSS for layout or that merge cells in non-standard ways — in those cases, walk the rows manually with html_elements('tr') and extract td/th individually. html_attr() retrieves attribute values such as href, data-*, or src.

parse.R
r
123456789101112131415161718192021library(rvest)
library(purrr)

page <- read_html(html)

books <- page |>
  html_elements("article.product_pod") |>
  map(\(card) {
    list(
      title    = card |> html_element("h3 a")      |> html_attr("title"),
      price    = card |> html_element(".price_color") |> html_text2(),
      rating   = card |> html_element("p.star-rating") |> html_attr("class"),
      in_stock = grepl(
        "In stock",
        card |> html_element(".instock.availability") |> html_text2()
      )
    )
  }) |>
  list_rbind()

print(head(books, 3))

4.Clean and persist data as a tibble

list_rbind() from purrr converts the list of records into a tibble without extra dependencies. readr::parse_number() strips currency symbols and thousands separators in one call — more reliable than gsub() chains. Always add a scraped_at timestamp column before writing; when you re-run the scraper weeks later you need to know which rows came from which run.

Save raw HTML snapshots alongside the CSV. If your selector breaks because the site redesigned, you can re-parse the archived HTML without re-scraping. Store files with ISO-8601 dates in the filename so ls() and file.info() sort chronologically. Avoid saving only .RData — it is opaque to version control and diff tools.

tidy.R
r
1234567891011121314151617181920212223library(dplyr)
library(readr)

books_df <- books |>
  mutate(
    # strip "£" and parse to numeric
    price_num  = parse_number(price),
    # extract star count from class string, e.g. "star-rating Three"
    stars      = case_when(
      grepl("One",   rating) ~ 1L,
      grepl("Two",   rating) ~ 2L,
      grepl("Three", rating) ~ 3L,
      grepl("Four",  rating) ~ 4L,
      grepl("Five",  rating) ~ 5L,
      .default = NA_integer_
    ),
    scraped_at = Sys.time()
  ) |>
  select(title, price_num, stars, in_stock, scraped_at)

date_tag <- format(Sys.Date(), "%Y-%m-%d")
write_csv(books_df, paste0("data/books_", date_tag, ".csv"))
cat("Wrote", nrow(books_df), "rows\n")

5.When resp_body_string returns a challenge page

Central bank portals, academic journal supplements, vendor dashboards, and government procurement systems increasingly sit behind Cloudflare, Akamai, or custom CAPTCHA systems. The tell-tale signs in R: grepl("Checking your browser", html) returns TRUE, html_elements() finds zero nodes where you expect dozens, or the HTML contains a meta refresh to a /cdn-cgi/ path.

At that point, tweaking req_user_agent() or adding Accept-Language headers will not help — the challenge requires JavaScript execution and sometimes fingerprint-based proof-of-work. Route those requests through OmniScrape instead. The Cloudflare bypass guide explains what the service handles on your behalf. Your rvest parsing code does not change; you just swap the HTML source.

A simple guard before parsing:

detect_challenge.R
r
12345678910111213# Detect challenge pages before attempting to parse
is_challenge <- function(html) {
  any(grepl(
    c("Checking your browser", "cf-browser-verification",
      "Enable JavaScript", "cdn-cgi/challenge"),
    html
  ))
}

if (is_challenge(html)) {
  message("Bot protection detected — routing through OmniScrape")
  # proceed to omniscrape section below
}

6.httr2 + OmniScrape for protected pages

POST a JSON body with req_body_json(). Pass your API key in req_headers() — store it in .Renviron as OMNISCRAPE_KEY and read it with Sys.getenv() so it never appears in committed code or knitted output. Set req_timeout() to at least 120 seconds; js_rendering mode spins up a headless browser and the round-trip takes longer than a plain HTTP fetch.

The response body is parsed by resp_body_json() into an R list. Check body$success before accessing body$data$content — on failure the error details are in body$error. Pipe body$data$content directly into read_html(); all your existing rvest selectors work unchanged.

omniscrape.R
r
1234567891011121314151617181920212223242526272829303132333435363738library(httr2)
library(rvest)

api_key <- Sys.getenv("OMNISCRAPE_KEY")
if (nchar(api_key) == 0L) stop(".Renviron missing OMNISCRAPE_KEY")

resp <- request("https://api.omniscrape.io/v1/scrape") |>
  req_method("POST") |>
  req_headers("X-API-Key" = api_key, "Content-Type" = "application/json") |>
  req_body_json(list(
    url           = "https://protected-portal.gov/statistics/q4",
    mode          = "auto",
    output_format = "html",
    enable_solver = TRUE,
    proxy         = "residential:us"
  )) |>
  req_timeout(120) |>
  req_error(is_error = \(resp) FALSE) |>
  req_perform()

body <- resp_body_json(resp)

if (!isTRUE(body$success)) {
  stop("OmniScrape error: ", jsonlite::toJSON(body$error, auto_unbox = TRUE))
}

# HTML is in data$content — not data$html
html <- body$data$content
cat(
  "Method:", body$metadata$method_used,
  "| Solver:", body$metadata$solver_used,
  "| Cost: $", body$billing$charged,
  "| Balance: $", body$billing$balance_after, "\n"
)

page <- read_html(html)
rate <- page |> html_element(".headline-rate") |> html_text2()
cat("Rate:", rate, "\n")

7.Server-side CSS extraction for structured fields

When you only need a handful of fields — a price, a headline figure, a stock status — the css_extractor output format lets OmniScrape apply your CSS selectors server-side and return a named list in body$data$css_extracted. You skip read_html() and html_elements() entirely, which simplifies the R code and reduces the payload size.

This is particularly useful in Shiny applications where you want to display a few KPIs fetched live: less data transferred, less parsing overhead, and the result maps directly to a tibble with as_tibble() or to individual reactive values.

css_extractor.R
r
123456789101112131415161718192021222324252627282930313233resp <- request("https://api.omniscrape.io/v1/scrape") |>
  req_method("POST") |>
  req_headers("X-API-Key" = api_key) |>
  req_body_json(list(
    url           = "https://protected-shop.com/product/42",
    mode          = "auto",
    output_format = "css_extractor",
    enable_solver = TRUE,
    css_selectors = list(
      title       = "h1.product-title",
      price       = "span.price-now",
      stock       = "p.availability",
      rating      = "span.review-score"
    )
  )) |>
  req_timeout(60) |>
  req_perform()

body <- resp_body_json(resp)

if (!isTRUE(body$success)) {
  stop("Extraction failed: ", jsonlite::toJSON(body$error, auto_unbox = TRUE))
}

fields <- body$data$css_extracted
# fields is a named list; convert to a one-row tibble
result <- as_tibble(fields) |>
  mutate(
    price_num  = readr::parse_number(price),
    scraped_at = Sys.time()
  )

print(result)

8.JavaScript-rendered tables and SPAs

rvest operates on the HTML string returned by the server — it has no JavaScript engine. Dashboards that populate tables by dispatching fetch() or XMLHttpRequest calls after page load will appear empty when parsed with html_table(). The symptom is html_elements('table') returning a nodeset of length zero on a page that visually shows a full data grid in the browser.

Use mode js_rendering with js_wait_selector set to a CSS selector that only appears once the target data is in the DOM. js_wait_timeout is in milliseconds; 15 000 (15 s) is a reasonable starting point for dashboards that fetch data from a slow API. See scraping JavaScript-rendered pages for a deeper treatment of wait strategies.

js_rendering.R
r
123456789101112131415161718192021resp <- request("https://api.omniscrape.io/v1/scrape") |>
  req_method("POST") |>
  req_headers("X-API-Key" = api_key) |>
  req_body_json(list(
    url              = "https://spa-dashboard.com/metrics",
    mode             = "js_rendering",
    output_format    = "html",
    js_wait_selector = "table.data-grid tbody tr",
    js_wait_timeout  = 15000
  )) |>
  req_timeout(120) |>
  req_perform()

body <- resp_body_json(resp)
if (!isTRUE(body$success)) stop("JS render failed")

# body$data$content holds the post-JS HTML
page  <- read_html(body$data$content)
tbl   <- page |> html_element("table.data-grid") |> html_table()
cat("Rows fetched:", nrow(tbl), "\n")
print(head(tbl))

9.Reproducible research habits for scraped data

Academic and policy workflows require audit trails that survive journal peer review, lab handovers, and re-analysis years later. Scraped data is inherently volatile — sites change structure, disappear, or add access controls. Build defensively from the start rather than retrofitting reproducibility after a deadline.

Key practices to adopt from the first script:

  • Store OMNISCRAPE_KEY in ~/.Renviron, never in .R scripts or knitted Rmd/Quarto output — use usethis::edit_r_environ() to open the file safely
  • Save raw HTML to disk immediately after fetching, with an ISO-8601 timestamp and the target URL's slug in the filename (e.g. data/raw/portal-gov-q4_2024-11-01.html)
  • Log body$metadata$method_used, body$metadata$solver_used, and body$billing$charged in a structured scrape_log.csv alongside the data — useful for cost tracking and debugging
  • Pin package versions with renv::snapshot() and commit renv.lock to version control so collaborators and CI reproduce the identical environment
  • Use Rscript + cron (Linux/macOS) or Task Scheduler (Windows) for production scheduled scrapes rather than manual click-run — document the schedule in a README
  • Write a data provenance section in your Rmd/Quarto document that records the scrape date range, source URLs, and OmniScrape mode used

10.Handle API failures and HTTP errors in R

Production scrapers need explicit error handling at two layers: the HTTP response status from httr2, and the body$success flag from OmniScrape. Use purrr::safely() or purrr::possibly() to wrap scrape calls in batch jobs so one failed URL does not abort the entire run.

  • HTTP 401 — API key missing or malformed; fix .Renviron, call usethis::edit_r_environ(), stop the pipeline immediately
  • HTTP 402 — account balance exhausted; pause the scheduled job, notify the account holder, do not retry automatically
  • HTTP 429 — rate limit exceeded; implement exponential backoff with Sys.sleep(2^attempt) inside a tryCatch loop, cap at 3–4 attempts
  • HTTP 502 / 503 — transient upstream error; retry up to 3 times with a short delay, log each attempt
  • body$success == FALSE — the scrape completed but the target was unreachable or returned an unexpected response; log the URL and body$error, continue the batch with purrr::safely()
  • Encoding errors in html_text2() — force UTF-8 on write with write_csv(..., locale = locale(encoding = 'UTF-8')); garbled non-ASCII text is almost always a write-encoding mismatch, not an rvest bug
safe_batch.R
r
1234567891011121314151617181920212223242526272829library(purrr)

safe_scrape <- safely(\(target_url) {
  resp <- request("https://api.omniscrape.io/v1/scrape") |>
    req_method("POST") |>
    req_headers("X-API-Key" = api_key) |>
    req_body_json(list(
      url           = target_url,
      mode          = "auto",
      output_format = "html",
      enable_solver = TRUE
    )) |>
    req_timeout(120) |>
    req_error(is_error = \(r) FALSE) |>
    req_perform()

  body <- resp_body_json(resp)
  if (!isTRUE(body$success)) stop("API error: ", body$error$message)
  body$data$content
})

urls    <- c("https://site-a.com/data", "https://site-b.com/data")
results <- map(urls, safe_scrape)

successes <- keep(results, \(r) is.null(r$error))
failures  <- keep(results, \(r) !is.null(r$error))

cat("Succeeded:", length(successes), "| Failed:", length(failures), "\n")
walk(failures, \(r) message("Error: ", r$error$message))

Frequently asked questions

Should I use httr2 or the older httr package?

Use httr2 for all new projects. httr is in maintenance mode — it receives security fixes but no new features. httr2 has a cleaner pipe-based API, built-in retry and throttle helpers (req_retry(), req_throttle()), proper OAuth2 support, and better error handling via req_error(). Migration from httr is straightforward: request() replaces GET()/POST(), resp_body_string() replaces content(..., as='text').

When should I use rvest versus xml2 directly?

rvest is a wrapper around xml2 optimised for CSS-selector-based HTML scraping — it is the right default. Use xml2 directly when you need XPath expressions, when you are processing strict XML (RSS feeds, Atom, SOAP responses), or when you need to modify and re-serialise a document. For HTML scraping, rvest's html_elements(), html_text2(), and html_table() cover the vast majority of cases.

Can I call OmniScrape from a live Shiny application?

Technically yes, but it is rarely the right architecture. OmniScrape requests take 2–30 seconds depending on the mode and solver; that latency will block a reactive and frustrate users. The better pattern is to pre-scrape on a schedule (cron + Rscript), write results to a parquet file or database, and have Shiny read from that cache. Reserve live OmniScrape calls for on-demand refresh buttons where the user explicitly accepts a wait.

How do I scrape multiple pages in parallel without getting blocked?

Use future + furrr for parallelism and httr2's req_throttle() to enforce a minimum delay between requests to the same host. For sites behind bot protection, parallel requests through OmniScrape are safe because the API manages IP rotation and session handling — you can fan out with furrr::future_map() without worrying about triggering rate limits on your own IP. Keep concurrency modest (4–8 workers) to avoid exhausting your API balance faster than expected.

How do I handle non-ASCII characters and encoding issues?

rvest reads encoding from the HTML meta charset declaration and handles UTF-8 correctly in most cases. Problems usually appear at write time: write_csv() defaults to UTF-8 on most platforms, but on Windows the default locale may be Latin-1. Explicitly pass locale = locale(encoding = 'UTF-8') to write_csv(), or use write_csv() from readr >= 2.0 which always writes UTF-8. If source HTML is in a legacy encoding (ISO-8859-1, Windows-1252), read_html(html, encoding = 'latin1') forces the correct interpretation.

What is the difference between mode auto and mode js_rendering?

mode auto attempts a fast HTTP fetch first and escalates to a headless browser automatically if the response looks like a bot challenge or if the content is clearly incomplete. It is the right default for most targets. mode js_rendering always uses a headless browser, which costs more credits and takes longer but guarantees JavaScript execution — use it when you know the target is a SPA or dashboard that renders content entirely client-side and auto's escalation heuristic is not triggering reliably.

How do I integrate OmniScrape scraping into an R package or research compendium?

Store the API key in .Renviron and read it with Sys.getenv() inside your package functions — never hardcode it. Document the environment variable in your package README and DESCRIPTION. For a research compendium, add a data-raw/ directory containing the scraping scripts, and use targets or drake to define a pipeline where scraping is one step and analysis is downstream. This makes the data provenance explicit and allows collaborators to reproduce the full pipeline by running targets::tar_make().

Related guides

  • Web Scraping with Python
  • How to Bypass Cloudflare When Web Scraping
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

Ready to get started?

Start scraping protected sites today — no credit card required.

OmniScrape

Web scraping infrastructure for developers. One API call to bypass any protection.

All systems operational

Product

  • Web Unlocker
  • Browser-as-a-Service
  • Residential Proxies
  • Pricing

Developers

  • API Reference ↗
  • Quickstart ↗
  • All Guides
  • Use Cases
  • Status

Company

  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • Cookie Policy
  • Acceptable Use

Solutions

  • E-commerce Web Scraping: Catalog Intelligence at Production Scale
  • Real Estate Web Scraping: Listings, Comps, and Market Data
  • SERP Web Scraping: Agency Rank Tracking Workflow
  • Job Board Web Scraping: HR Tech Pipeline for Labor Market Intelligence
  • Price Monitoring with Web Scraping: A Practical Developer Guide
  • Lead Generation Web Scraping: Compliant Inbound Enrichment for Sales Teams
  • Market Research Web Scraping: Multi-Geo Data Collection for Research Firms
  • Sentiment Analysis Web Scraping: Build a Production Review Pipeline
  • Logistics Web Scraping: Carrier Rates, Port ETAs, and Sailing Schedules
  • Social Media Web Scraping: Brand Mention Monitoring from Public Pages
  • LLM Training Data Scraping: Building Clean Web Corpora
  • Travel Web Scraping: Hotel Rates, Flight Fares & Parity Monitoring

Web Scraping by Language

  • Web Scraping with Python
  • Web Scraping with Node.js: fetch, Cheerio, and the OmniScrape API
  • Web Scraping with Java: HttpClient, Jsoup, and OmniScrape API
  • Web Scraping with PHP
  • Web Scraping with Go (Golang)
  • Web Scraping with Ruby: Faraday, Nokogiri, Sidekiq & OmniScrape
  • Web Scraping with C#: HttpClient, AngleSharp, and OmniScrape API
  • Web Scraping with Rust
  • Web Scraping with R: httr2, rvest, and the OmniScrape API
  • Web Scraping with C++
  • Web Scraping with Elixir
  • Web Scraping with Perl: Mojo::UserAgent, Mojo::DOM, and OmniScrape

Anti-Bot Bypass

  • How to Bypass Cloudflare When Web Scraping
  • How to Bypass DataDome When Web Scraping
  • How to Bypass Akamai Bot Manager When Web Scraping
  • How to Bypass PerimeterX (HUMAN Security) When Web Scraping
  • Bypassing AWS WAF When Web Scraping: Rate Rules, Bot Control, and Residential Proxies
  • How to Bypass Imperva (Incapsula) When Web Scraping
  • How to Bypass Kasada Bot Protection When Web Scraping
  • How to Bypass F5 BIG-IP Bot Defense When Web Scraping
  • How to Bypass Distil Networks When Web Scraping
  • How to Bypass reCAPTCHA When Web Scraping

Scraping Tools

  • Playwright Web Scraping: Practical Patterns for Protected Sites
  • Puppeteer Web Scraping: Patterns, Anti-Bot Limits, and BaaS Integration
  • Selenium Web Scraping: Practical Patterns for Real-World Projects
  • Scrapy Web Scraping with OmniScrape: Download Middleware, Pipelines, and Scale
  • Beautiful Soup Web Scraping: A Practical Guide
  • cURL Web Scraping: Shell-Native Patterns with OmniScrape
  • HTTPX Web Scraping: Async Python with OmniScrape
  • Cheerio Web Scraping: A Practical Guide

Site-Specific Scrapers

  • Amazon Scraper: Product Data, Buy Box, Reviews, and Multi-Marketplace
  • Google Search Scraper: Extract SERP Rankings and Features
  • Google Maps Scraper: Extract Business Listings and Place Data
  • LinkedIn Scraper: Companies, Jobs, and Public Profiles
  • Walmart Scraper: Prices, Stock, Rollback Deals, and Fulfillment Data
  • eBay Scraper: Extract Listings, Auctions, and Sold Prices
  • Shopify Scraper: Products, Variants, and JSON Endpoints
  • Indeed Scraper: Extract Job Listings, Salaries, and Company Data
  • Zillow Scraper: Extract Listings, Zestimates, and Price History
  • Reddit Scraper: Posts, Comments, and Subreddit Data
  • X (Twitter) Scraper: Tweets, Profiles, and Hashtags
  • Instagram Scraper: Posts, Reels, and Profile Metrics
  • TikTok Scraper: Extract Videos, Hashtags, and Trend Data
  • YouTube Scraper: Extract Video Metadata, Comments, and Channel Stats
  • Booking.com Scraper: Hotel Rates, Room Types, and Availability
  • Airbnb Scraper: Listings, Calendars, and Nightly Rates
  • Crunchbase Scraper: Extract Funding Rounds, Companies, and Investors
  • Yelp Scraper: Extract Business Listings, Ratings, and Reviews
  • Glassdoor Scraper: Employer Ratings, Salaries, and Review Data
  • Trustpilot Scraper: TrustScore, Star Distribution, and Review Monitoring

How We Compare

  • OmniScrape vs ScrapingBee
  • OmniScrape vs ZenRows
  • OmniScrape vs ScraperAPI: A Practical Developer Comparison
  • OmniScrape vs Bright Data: Which Web Scraping Platform Fits Your Team?
  • OmniScrape vs Oxylabs
  • OmniScrape vs Smartproxy
  • OmniScrape vs Crawlbase: API Design, Observability, and Migration Guide
  • OmniScrape vs Apify

Web Scraping Guides

  • Web Scraping Without Getting Blocked
  • Web Scraping Proxy Guide: Types, Sessions, Geo, and OmniScrape Integration
  • Solve CAPTCHAs While Web Scraping
  • Web Scraping vs Web Crawling: Architecture, Patterns, and When to Use Each
  • Headless Browser Scraping: When to Use It and How to Do It Right
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns
  • Rotating Proxies for Web Scraping: Policies, Session Binding, and Geo Pools
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs

© 2026 OmniScrape. All rights reserved.

PrivacyTermsRefundsAcceptable Use