1.Cargo.toml dependencies
reqwest with json feature, tokio for async, scraper for DOM queries, serde for API responses.
123456[dependencies]
reqwest = { version = "0.12", features = ["json"] }
tokio = { version = "1", features = ["full"] }
scraper = "0.20"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
2.Blocking fetch for CLI tools
A synchronous reqwest::blocking client is fine for one-shot binaries. Production services usually move to async Tokio.
12345678910let client = reqwest::blocking::Client::builder()
.timeout(std::time::Duration::from_secs(30))
.build()?;
let html = client
.get("https://books.toscrape.com/catalogue/page-1.html")
.send()?
.text()?;
println!("fetched {} bytes", html.len());
3.Parse with scraper
Html::parse_document builds a DOM. select with a Selector parses CSS at startup — compile selectors once outside hot loops.
1234567891011121314151617181920use scraper::{Html, Selector};
let document = Html::parse_document(&html);
let card_sel = Selector::parse("article.product_pod").unwrap();
let title_sel = Selector::parse("h3 a").unwrap();
let price_sel = Selector::parse(".price_color").unwrap();
let mut books = Vec::new();
for card in document.select(&card_sel) {
let title = card.select(&title_sel).next()
.and_then(|el| el.value().attr("title"))
.unwrap_or("")
.to_string();
let price = card.select(&price_sel).next()
.map(|el| el.text().collect::<String>().trim().to_string())
.unwrap_or_default();
books.push((title, price));
}
println!("found {} books", books.len());
4.Model the API response
serde structs force you to think about optional fields — good when OmniScrape adds metadata. Never unwrap() on production JSON without a fallback.
1234567891011121314151617181920212223#[derive(Debug, Deserialize)]
struct ScrapeResponse {
success: bool,
data: Option<ScrapeData>,
metadata: Option<Metadata>,
billing: Option<Billing>,
}
#[derive(Debug, Deserialize)]
struct ScrapeData {
content: Option<String>,
css_extracted: Option<serde_json::Value>,
}
#[derive(Debug, Deserialize)]
struct Metadata {
method_used: String,
}
#[derive(Debug, Deserialize)]
struct Billing {
charged: f64,
}
5.Async OmniScrape with reqwest
Tokio + reqwest scales concurrent API calls. Use a Semaphore from tokio::sync to cap in-flight js_rendering jobs.
When direct fetch fails on protected sites, this replaces your GET entirely — see Cloudflare bypass for why.
12345678910111213141516171819202122232425262728293031323334#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = reqwest::Client::new();
let api_key = std::env::var("OMNISCRAPE_KEY")?;
let res = client
.post("https://api.omniscrape.io/v1/scrape")
.header("X-API-Key", api_key)
.json(&serde_json::json!({
"url": "https://protected-shop.com/item/99",
"mode": "auto",
"output_format": "html",
}))
.timeout(std::time::Duration::from_secs(120))
.send()
.await?;
let body: ScrapeResponse = res.json().await?;
if !body.success {
anyhow::bail!("scrape failed");
}
let html = body.data.and_then(|d| d.content).unwrap_or_default();
let document = Html::parse_document(&html);
let price_sel = Selector::parse(".product-price").unwrap();
let price = document.select(&price_sel).next()
.map(|el| el.text().collect::<String>());
println!("price: {:?}", price);
if let Some(m) = body.metadata {
println!("via {}", m.method_used);
}
Ok(())
}
6.Concurrent fetches with Tokio
futures::future::join_all or a stream with buffer_unordered processes URL lists. Handle per-URL Err without aborting the batch.
12345678910111213141516171819202122use tokio::sync::Semaphore;
use std::sync::Arc;
let sem = Arc::new(Semaphore::new(5));
let urls = vec!["https://example.com/a", "https://example.com/b"];
let handles: Vec<_> = urls.into_iter().map(|url| {
let client = client.clone();
let sem = sem.clone();
let key = api_key.clone();
tokio::spawn(async move {
let _permit = sem.acquire().await.unwrap();
scrape_one(&client, &key, url).await
})
}).collect();
for h in handles {
match h.await? {
Ok(data) => println!("ok: {:?}", data),
Err(e) => eprintln!("err: {}", e),
}
}
7.Skip scraper with css_extractor
When you only need a few fields, deserialize css_extracted into a struct and skip DOM walking entirely.
123456789.json(&serde_json::json!({
"url": target,
"mode": "auto",
"output_format": "css_extractor",
"css_selectors": {
"title": "h1",
"price": ".price"
}
}))
8.js_rendering for client-rendered HTML
scraper parses static trees only. React SPAs need js_rendering with js_wait_selector — scraping JavaScript-rendered pages covers when to use it.
9.Result types in production
Map API failures to actionable variants instead of panicking:
- 401 — config error, return early
- 402 — budget exhausted, stop scheduler
- 429 — sleep with jitter, retry
- 502 — retry with cap
- success:false — log URL to dead-letter store
Frequently asked questions
reqwest blocking or async?
Blocking for CLIs and quick tools. Async Tokio for services fanning out hundreds of OmniScrape calls.
scraper or select.rs?
scraper is the common choice with familiar CSS selectors. select.rs is lighter if you only need a few queries.
Should I build anti-bot bypass in Rust?
Only if bypass engineering is your product. Otherwise OmniScrape keeps your Rust code focused on parsing and storage.
How do I avoid compiling selectors every iteration?
Parse Selector::parse once at startup, clone into tasks, or use lazy_static/OnceLock.
hyper directly instead of reqwest?
hyper for maximal control. reqwest is ergonomic for JSON APIs like OmniScrape.
Related guides