1.Install and load packages
Three packages cover the full workflow: httr2 handles all HTTP concerns (request building, authentication headers, retries, timeouts), rvest handles HTML parsing and CSS-selector-based extraction, and jsonlite provides explicit JSON serialisation when you need fine-grained control over how lists are converted. All three are on CRAN and install without system dependencies on Linux, macOS, and Windows.
If you are working in a reproducible research context — Rmd, Quarto, or a packaged analysis — pin package versions with renv::snapshot() after installation so collaborators and CI reproduce the same environment.
12345install.packages(c("httr2", "rvest", "jsonlite", "dplyr", "readr", "purrr"))
# Verify versions
packageVersion("httr2") # >= 1.0.0 recommended
packageVersion("rvest") # >= 1.0.3
2.Fetch a page with httr2
httr2's pipe-based API lets you compose a request incrementally before executing it. req_user_agent() sets a descriptive agent string — many sites return different content or block requests with the default libcurl agent. req_timeout() prevents hung connections from stalling a batch job. req_perform() executes the request and returns a response object; resp_body_string() materialises the body as a character vector.
For multi-page crawls, wrap req_perform() in purrr::map() with req_throttle() to respect crawl delays. Check resp_status() before parsing — a 200 with a bot-challenge body is more dangerous than an explicit 403 because rvest will parse the challenge page silently.
1234567891011121314151617library(httr2)
resp <- request("https://books.toscrape.com/catalogue/page-1.html") |>
req_user_agent("RScraperBot/1.0 (+https://yourlab.example.com)") |>
req_timeout(30) |>
req_error(is_error = \(resp) FALSE) |> # handle errors manually
req_perform()
status <- resp_status(resp)
cat("HTTP status:", status, "\n")
if (status != 200L) {
stop("Unexpected status: ", status)
}
html <- resp_body_string(resp)
cat("Fetched", nchar(html), "characters\n")
3.Extract tables and nodes with rvest
read_html() accepts a character string of raw HTML and returns an xml_document object. html_elements() applies a CSS selector and returns a nodeset; html_element() returns the first match within a given context node. html_text2() is preferred over html_text() because it collapses whitespace the way a browser would, stripping leading and trailing space and collapsing internal runs.
html_table() works well on government statistical tables that use proper thead/tbody markup. It struggles with tables that use CSS for layout or that merge cells in non-standard ways — in those cases, walk the rows manually with html_elements('tr') and extract td/th individually. html_attr() retrieves attribute values such as href, data-*, or src.
123456789101112131415161718192021library(rvest)
library(purrr)
page <- read_html(html)
books <- page |>
html_elements("article.product_pod") |>
map(\(card) {
list(
title = card |> html_element("h3 a") |> html_attr("title"),
price = card |> html_element(".price_color") |> html_text2(),
rating = card |> html_element("p.star-rating") |> html_attr("class"),
in_stock = grepl(
"In stock",
card |> html_element(".instock.availability") |> html_text2()
)
)
}) |>
list_rbind()
print(head(books, 3))
4.Clean and persist data as a tibble
list_rbind() from purrr converts the list of records into a tibble without extra dependencies. readr::parse_number() strips currency symbols and thousands separators in one call — more reliable than gsub() chains. Always add a scraped_at timestamp column before writing; when you re-run the scraper weeks later you need to know which rows came from which run.
Save raw HTML snapshots alongside the CSV. If your selector breaks because the site redesigned, you can re-parse the archived HTML without re-scraping. Store files with ISO-8601 dates in the filename so ls() and file.info() sort chronologically. Avoid saving only .RData — it is opaque to version control and diff tools.
1234567891011121314151617181920212223library(dplyr)
library(readr)
books_df <- books |>
mutate(
# strip "£" and parse to numeric
price_num = parse_number(price),
# extract star count from class string, e.g. "star-rating Three"
stars = case_when(
grepl("One", rating) ~ 1L,
grepl("Two", rating) ~ 2L,
grepl("Three", rating) ~ 3L,
grepl("Four", rating) ~ 4L,
grepl("Five", rating) ~ 5L,
.default = NA_integer_
),
scraped_at = Sys.time()
) |>
select(title, price_num, stars, in_stock, scraped_at)
date_tag <- format(Sys.Date(), "%Y-%m-%d")
write_csv(books_df, paste0("data/books_", date_tag, ".csv"))
cat("Wrote", nrow(books_df), "rows\n")
5.When resp_body_string returns a challenge page
Central bank portals, academic journal supplements, vendor dashboards, and government procurement systems increasingly sit behind Cloudflare, Akamai, or custom CAPTCHA systems. The tell-tale signs in R: grepl("Checking your browser", html) returns TRUE, html_elements() finds zero nodes where you expect dozens, or the HTML contains a meta refresh to a /cdn-cgi/ path.
At that point, tweaking req_user_agent() or adding Accept-Language headers will not help — the challenge requires JavaScript execution and sometimes fingerprint-based proof-of-work. Route those requests through OmniScrape instead. The Cloudflare bypass guide explains what the service handles on your behalf. Your rvest parsing code does not change; you just swap the HTML source.
A simple guard before parsing:
12345678910111213# Detect challenge pages before attempting to parse
is_challenge <- function(html) {
any(grepl(
c("Checking your browser", "cf-browser-verification",
"Enable JavaScript", "cdn-cgi/challenge"),
html
))
}
if (is_challenge(html)) {
message("Bot protection detected — routing through OmniScrape")
# proceed to omniscrape section below
}
6.httr2 + OmniScrape for protected pages
POST a JSON body with req_body_json(). Pass your API key in req_headers() — store it in .Renviron as OMNISCRAPE_KEY and read it with Sys.getenv() so it never appears in committed code or knitted output. Set req_timeout() to at least 120 seconds; js_rendering mode spins up a headless browser and the round-trip takes longer than a plain HTTP fetch.
The response body is parsed by resp_body_json() into an R list. Check body$success before accessing body$data$content — on failure the error details are in body$error. Pipe body$data$content directly into read_html(); all your existing rvest selectors work unchanged.
1234567891011121314151617181920212223242526272829303132333435363738library(httr2)
library(rvest)
api_key <- Sys.getenv("OMNISCRAPE_KEY")
if (nchar(api_key) == 0L) stop(".Renviron missing OMNISCRAPE_KEY")
resp <- request("https://api.omniscrape.io/v1/scrape") |>
req_method("POST") |>
req_headers("X-API-Key" = api_key, "Content-Type" = "application/json") |>
req_body_json(list(
url = "https://protected-portal.gov/statistics/q4",
mode = "auto",
output_format = "html",
enable_solver = TRUE,
proxy = "residential:us"
)) |>
req_timeout(120) |>
req_error(is_error = \(resp) FALSE) |>
req_perform()
body <- resp_body_json(resp)
if (!isTRUE(body$success)) {
stop("OmniScrape error: ", jsonlite::toJSON(body$error, auto_unbox = TRUE))
}
# HTML is in data$content — not data$html
html <- body$data$content
cat(
"Method:", body$metadata$method_used,
"| Solver:", body$metadata$solver_used,
"| Cost: $", body$billing$charged,
"| Balance: $", body$billing$balance_after, "\n"
)
page <- read_html(html)
rate <- page |> html_element(".headline-rate") |> html_text2()
cat("Rate:", rate, "\n")
7.Server-side CSS extraction for structured fields
When you only need a handful of fields — a price, a headline figure, a stock status — the css_extractor output format lets OmniScrape apply your CSS selectors server-side and return a named list in body$data$css_extracted. You skip read_html() and html_elements() entirely, which simplifies the R code and reduces the payload size.
This is particularly useful in Shiny applications where you want to display a few KPIs fetched live: less data transferred, less parsing overhead, and the result maps directly to a tibble with as_tibble() or to individual reactive values.
123456789101112131415161718192021222324252627282930313233resp <- request("https://api.omniscrape.io/v1/scrape") |>
req_method("POST") |>
req_headers("X-API-Key" = api_key) |>
req_body_json(list(
url = "https://protected-shop.com/product/42",
mode = "auto",
output_format = "css_extractor",
enable_solver = TRUE,
css_selectors = list(
title = "h1.product-title",
price = "span.price-now",
stock = "p.availability",
rating = "span.review-score"
)
)) |>
req_timeout(60) |>
req_perform()
body <- resp_body_json(resp)
if (!isTRUE(body$success)) {
stop("Extraction failed: ", jsonlite::toJSON(body$error, auto_unbox = TRUE))
}
fields <- body$data$css_extracted
# fields is a named list; convert to a one-row tibble
result <- as_tibble(fields) |>
mutate(
price_num = readr::parse_number(price),
scraped_at = Sys.time()
)
print(result)
8.JavaScript-rendered tables and SPAs
rvest operates on the HTML string returned by the server — it has no JavaScript engine. Dashboards that populate tables by dispatching fetch() or XMLHttpRequest calls after page load will appear empty when parsed with html_table(). The symptom is html_elements('table') returning a nodeset of length zero on a page that visually shows a full data grid in the browser.
Use mode js_rendering with js_wait_selector set to a CSS selector that only appears once the target data is in the DOM. js_wait_timeout is in milliseconds; 15 000 (15 s) is a reasonable starting point for dashboards that fetch data from a slow API. See scraping JavaScript-rendered pages for a deeper treatment of wait strategies.
123456789101112131415161718192021resp <- request("https://api.omniscrape.io/v1/scrape") |>
req_method("POST") |>
req_headers("X-API-Key" = api_key) |>
req_body_json(list(
url = "https://spa-dashboard.com/metrics",
mode = "js_rendering",
output_format = "html",
js_wait_selector = "table.data-grid tbody tr",
js_wait_timeout = 15000
)) |>
req_timeout(120) |>
req_perform()
body <- resp_body_json(resp)
if (!isTRUE(body$success)) stop("JS render failed")
# body$data$content holds the post-JS HTML
page <- read_html(body$data$content)
tbl <- page |> html_element("table.data-grid") |> html_table()
cat("Rows fetched:", nrow(tbl), "\n")
print(head(tbl))
9.Reproducible research habits for scraped data
Academic and policy workflows require audit trails that survive journal peer review, lab handovers, and re-analysis years later. Scraped data is inherently volatile — sites change structure, disappear, or add access controls. Build defensively from the start rather than retrofitting reproducibility after a deadline.
Key practices to adopt from the first script:
- Store OMNISCRAPE_KEY in ~/.Renviron, never in .R scripts or knitted Rmd/Quarto output — use usethis::edit_r_environ() to open the file safely
- Save raw HTML to disk immediately after fetching, with an ISO-8601 timestamp and the target URL's slug in the filename (e.g. data/raw/portal-gov-q4_2024-11-01.html)
- Log body$metadata$method_used, body$metadata$solver_used, and body$billing$charged in a structured scrape_log.csv alongside the data — useful for cost tracking and debugging
- Pin package versions with renv::snapshot() and commit renv.lock to version control so collaborators and CI reproduce the identical environment
- Use Rscript + cron (Linux/macOS) or Task Scheduler (Windows) for production scheduled scrapes rather than manual click-run — document the schedule in a README
- Write a data provenance section in your Rmd/Quarto document that records the scrape date range, source URLs, and OmniScrape mode used
10.Handle API failures and HTTP errors in R
Production scrapers need explicit error handling at two layers: the HTTP response status from httr2, and the body$success flag from OmniScrape. Use purrr::safely() or purrr::possibly() to wrap scrape calls in batch jobs so one failed URL does not abort the entire run.
- HTTP 401 — API key missing or malformed; fix .Renviron, call usethis::edit_r_environ(), stop the pipeline immediately
- HTTP 402 — account balance exhausted; pause the scheduled job, notify the account holder, do not retry automatically
- HTTP 429 — rate limit exceeded; implement exponential backoff with Sys.sleep(2^attempt) inside a tryCatch loop, cap at 3–4 attempts
- HTTP 502 / 503 — transient upstream error; retry up to 3 times with a short delay, log each attempt
- body$success == FALSE — the scrape completed but the target was unreachable or returned an unexpected response; log the URL and body$error, continue the batch with purrr::safely()
- Encoding errors in html_text2() — force UTF-8 on write with write_csv(..., locale = locale(encoding = 'UTF-8')); garbled non-ASCII text is almost always a write-encoding mismatch, not an rvest bug
1234567891011121314151617181920212223242526272829library(purrr)
safe_scrape <- safely(\(target_url) {
resp <- request("https://api.omniscrape.io/v1/scrape") |>
req_method("POST") |>
req_headers("X-API-Key" = api_key) |>
req_body_json(list(
url = target_url,
mode = "auto",
output_format = "html",
enable_solver = TRUE
)) |>
req_timeout(120) |>
req_error(is_error = \(r) FALSE) |>
req_perform()
body <- resp_body_json(resp)
if (!isTRUE(body$success)) stop("API error: ", body$error$message)
body$data$content
})
urls <- c("https://site-a.com/data", "https://site-b.com/data")
results <- map(urls, safe_scrape)
successes <- keep(results, \(r) is.null(r$error))
failures <- keep(results, \(r) !is.null(r$error))
cat("Succeeded:", length(successes), "| Failed:", length(failures), "\n")
walk(failures, \(r) message("Error: ", r$error$message))
Frequently asked questions
Should I use httr2 or the older httr package?
Use httr2 for all new projects. httr is in maintenance mode — it receives security fixes but no new features. httr2 has a cleaner pipe-based API, built-in retry and throttle helpers (req_retry(), req_throttle()), proper OAuth2 support, and better error handling via req_error(). Migration from httr is straightforward: request() replaces GET()/POST(), resp_body_string() replaces content(..., as='text').
When should I use rvest versus xml2 directly?
rvest is a wrapper around xml2 optimised for CSS-selector-based HTML scraping — it is the right default. Use xml2 directly when you need XPath expressions, when you are processing strict XML (RSS feeds, Atom, SOAP responses), or when you need to modify and re-serialise a document. For HTML scraping, rvest's html_elements(), html_text2(), and html_table() cover the vast majority of cases.
Can I call OmniScrape from a live Shiny application?
Technically yes, but it is rarely the right architecture. OmniScrape requests take 2–30 seconds depending on the mode and solver; that latency will block a reactive and frustrate users. The better pattern is to pre-scrape on a schedule (cron + Rscript), write results to a parquet file or database, and have Shiny read from that cache. Reserve live OmniScrape calls for on-demand refresh buttons where the user explicitly accepts a wait.
How do I scrape multiple pages in parallel without getting blocked?
Use future + furrr for parallelism and httr2's req_throttle() to enforce a minimum delay between requests to the same host. For sites behind bot protection, parallel requests through OmniScrape are safe because the API manages IP rotation and session handling — you can fan out with furrr::future_map() without worrying about triggering rate limits on your own IP. Keep concurrency modest (4–8 workers) to avoid exhausting your API balance faster than expected.
How do I handle non-ASCII characters and encoding issues?
rvest reads encoding from the HTML meta charset declaration and handles UTF-8 correctly in most cases. Problems usually appear at write time: write_csv() defaults to UTF-8 on most platforms, but on Windows the default locale may be Latin-1. Explicitly pass locale = locale(encoding = 'UTF-8') to write_csv(), or use write_csv() from readr >= 2.0 which always writes UTF-8. If source HTML is in a legacy encoding (ISO-8859-1, Windows-1252), read_html(html, encoding = 'latin1') forces the correct interpretation.
What is the difference between mode auto and mode js_rendering?
mode auto attempts a fast HTTP fetch first and escalates to a headless browser automatically if the response looks like a bot challenge or if the content is clearly incomplete. It is the right default for most targets. mode js_rendering always uses a headless browser, which costs more credits and takes longer but guarantees JavaScript execution — use it when you know the target is a SPA or dashboard that renders content entirely client-side and auto's escalation heuristic is not triggering reliably.
How do I integrate OmniScrape scraping into an R package or research compendium?
Store the API key in .Renviron and read it with Sys.getenv() inside your package functions — never hardcode it. Document the environment variable in your package README and DESCRIPTION. For a research compendium, add a data-raw/ directory containing the scraping scripts, and use targets or drake to define a pipeline where scraping is one step and analysis is downstream. This makes the data provenance explicit and allows collaborators to reproduce the full pipeline by running targets::tar_make().
Related guides