Web Scraping with Rust

Rust shows up in scrape orchestration when teams want memory safety and predictable tail latency — security scanners, fintech monitors, ad verification pipelines. reqwest handles HTTPS; the scraper crate parses HTML with CSS selectors on top of html5ever.

What Rust will not spare you is Cloudflare. Building TLS fingerprint mimicry in native code is a full-time job. POST to the OmniScrape API, deserialize JSON with serde, feed HTML to scraper. The Python scraping guide uses the same endpoint if you are prototyping in two languages.

1.Cargo.toml dependencies

reqwest with json feature, tokio for async, scraper for DOM queries, serde for API responses.

Cargo.toml

toml

123456[dependencies]
reqwest = { version = "0.12", features = ["json"] }
tokio = { version = "1", features = ["full"] }
scraper = "0.20"
serde = { version = "1", features = ["derive"] }
serde_json = "1"

2.Blocking fetch for CLI tools

A synchronous reqwest::blocking client is fine for one-shot binaries. Production services usually move to async Tokio.

fetch.rs

rust

12345678910let client = reqwest::blocking::Client::builder()
    .timeout(std::time::Duration::from_secs(30))
    .build()?;

let html = client
    .get("https://books.toscrape.com/catalogue/page-1.html")
    .send()?
    .text()?;

println!("fetched {} bytes", html.len());

3.Parse with scraper

Html::parse_document builds a DOM. select with a Selector parses CSS at startup — compile selectors once outside hot loops.

parse.rs

rust

1234567891011121314151617181920use scraper::{Html, Selector};

let document = Html::parse_document(&html);
let card_sel = Selector::parse("article.product_pod").unwrap();
let title_sel = Selector::parse("h3 a").unwrap();
let price_sel = Selector::parse(".price_color").unwrap();

let mut books = Vec::new();
for card in document.select(&card_sel) {
    let title = card.select(&title_sel).next()
        .and_then(|el| el.value().attr("title"))
        .unwrap_or("")
        .to_string();
    let price = card.select(&price_sel).next()
        .map(|el| el.text().collect::<String>().trim().to_string())
        .unwrap_or_default();
    books.push((title, price));
}

println!("found {} books", books.len());

4.Model the API response

serde structs force you to think about optional fields — good when OmniScrape adds metadata. Never unwrap() on production JSON without a fallback.

types.rs

rust

1234567891011121314151617181920212223#[derive(Debug, Deserialize)]
struct ScrapeResponse {
    success: bool,
    data: Option<ScrapeData>,
    metadata: Option<Metadata>,
    billing: Option<Billing>,
}

#[derive(Debug, Deserialize)]
struct ScrapeData {
    content: Option<String>,
    css_extracted: Option<serde_json::Value>,
}

#[derive(Debug, Deserialize)]
struct Metadata {
    method_used: String,
}

#[derive(Debug, Deserialize)]
struct Billing {
    charged: f64,
}

5.Async OmniScrape with reqwest

Tokio + reqwest scales concurrent API calls. Use a Semaphore from tokio::sync to cap in-flight js_rendering jobs.

When direct fetch fails on protected sites, this replaces your GET entirely — see Cloudflare bypass for why.

omniscrape.rs

rust

12345678910111213141516171819202122232425262728293031323334#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = reqwest::Client::new();
    let api_key = std::env::var("OMNISCRAPE_KEY")?;

    let res = client
        .post("https://api.omniscrape.io/v1/scrape")
        .header("X-API-Key", api_key)
        .json(&serde_json::json!({
            "url": "https://protected-shop.com/item/99",
            "mode": "auto",
            "output_format": "html",
        }))
        .timeout(std::time::Duration::from_secs(120))
        .send()
        .await?;

    let body: ScrapeResponse = res.json().await?;
    if !body.success {
        anyhow::bail!("scrape failed");
    }

    let html = body.data.and_then(|d| d.content).unwrap_or_default();
    let document = Html::parse_document(&html);
    let price_sel = Selector::parse(".product-price").unwrap();
    let price = document.select(&price_sel).next()
        .map(|el| el.text().collect::<String>());

    println!("price: {:?}", price);
    if let Some(m) = body.metadata {
        println!("via {}", m.method_used);
    }
    Ok(())
}

6.Concurrent fetches with Tokio

futures::future::join_all or a stream with buffer_unordered processes URL lists. Handle per-URL Err without aborting the batch.

pool.rs

rust

12345678910111213141516171819202122use tokio::sync::Semaphore;
use std::sync::Arc;

let sem = Arc::new(Semaphore::new(5));
let urls = vec!["https://example.com/a", "https://example.com/b"];

let handles: Vec<_> = urls.into_iter().map(|url| {
    let client = client.clone();
    let sem = sem.clone();
    let key = api_key.clone();
    tokio::spawn(async move {
        let _permit = sem.acquire().await.unwrap();
        scrape_one(&client, &key, url).await
    })
}).collect();

for h in handles {
    match h.await? {
        Ok(data) => println!("ok: {:?}", data),
        Err(e) => eprintln!("err: {}", e),
    }
}

7.Skip scraper with css_extractor

When you only need a few fields, deserialize css_extracted into a struct and skip DOM walking entirely.

structured.json

rust

123456789.json(&serde_json::json!({
    "url": target,
    "mode": "auto",
    "output_format": "css_extractor",
    "css_selectors": {
        "title": "h1",
        "price": ".price"
    }
}))

8.js_rendering for client-rendered HTML

scraper parses static trees only. React SPAs need js_rendering with js_wait_selector — scraping JavaScript-rendered pages covers when to use it.

9.Result types in production

Map API failures to actionable variants instead of panicking:

401 — config error, return early
402 — budget exhausted, stop scheduler
429 — sleep with jitter, retry
502 — retry with cap
success:false — log URL to dead-letter store

Frequently asked questions

reqwest blocking or async?

Blocking for CLIs and quick tools. Async Tokio for services fanning out hundreds of OmniScrape calls.

scraper or select.rs?

scraper is the common choice with familiar CSS selectors. select.rs is lighter if you only need a few queries.

Should I build anti-bot bypass in Rust?

Only if bypass engineering is your product. Otherwise OmniScrape keeps your Rust code focused on parsing and storage.

How do I avoid compiling selectors every iteration?

Parse Selector::parse once at startup, clone into tasks, or use lazy_static/OnceLock.

hyper directly instead of reqwest?

hyper for maximal control. reqwest is ergonomic for JSON APIs like OmniScrape.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.Cargo.toml dependencies

reqwest with json feature, tokio for async, scraper for DOM queries, serde for API responses.

Cargo.toml

toml

123456[dependencies]
reqwest = { version = "0.12", features = ["json"] }
tokio = { version = "1", features = ["full"] }
scraper = "0.20"
serde = { version = "1", features = ["derive"] }
serde_json = "1"

2.Blocking fetch for CLI tools

A synchronous reqwest::blocking client is fine for one-shot binaries. Production services usually move to async Tokio.

fetch.rs

rust

12345678910let client = reqwest::blocking::Client::builder()
    .timeout(std::time::Duration::from_secs(30))
    .build()?;

let html = client
    .get("https://books.toscrape.com/catalogue/page-1.html")
    .send()?
    .text()?;

println!("fetched {} bytes", html.len());

3.Parse with scraper

Html::parse_document builds a DOM. select with a Selector parses CSS at startup — compile selectors once outside hot loops.

parse.rs

rust

1234567891011121314151617181920use scraper::{Html, Selector};

let document = Html::parse_document(&html);
let card_sel = Selector::parse("article.product_pod").unwrap();
let title_sel = Selector::parse("h3 a").unwrap();
let price_sel = Selector::parse(".price_color").unwrap();

let mut books = Vec::new();
for card in document.select(&card_sel) {
    let title = card.select(&title_sel).next()
        .and_then(|el| el.value().attr("title"))
        .unwrap_or("")
        .to_string();
    let price = card.select(&price_sel).next()
        .map(|el| el.text().collect::<String>().trim().to_string())
        .unwrap_or_default();
    books.push((title, price));
}

println!("found {} books", books.len());

4.Model the API response

serde structs force you to think about optional fields — good when OmniScrape adds metadata. Never unwrap() on production JSON without a fallback.

types.rs

rust

1234567891011121314151617181920212223#[derive(Debug, Deserialize)]
struct ScrapeResponse {
    success: bool,
    data: Option<ScrapeData>,
    metadata: Option<Metadata>,
    billing: Option<Billing>,
}

#[derive(Debug, Deserialize)]
struct ScrapeData {
    content: Option<String>,
    css_extracted: Option<serde_json::Value>,
}

#[derive(Debug, Deserialize)]
struct Metadata {
    method_used: String,
}

#[derive(Debug, Deserialize)]
struct Billing {
    charged: f64,
}

5.Async OmniScrape with reqwest

Tokio + reqwest scales concurrent API calls. Use a Semaphore from tokio::sync to cap in-flight js_rendering jobs.

When direct fetch fails on protected sites, this replaces your GET entirely — see Cloudflare bypass for why.

omniscrape.rs

rust

12345678910111213141516171819202122232425262728293031323334#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = reqwest::Client::new();
    let api_key = std::env::var("OMNISCRAPE_KEY")?;

    let res = client
        .post("https://api.omniscrape.io/v1/scrape")
        .header("X-API-Key", api_key)
        .json(&serde_json::json!({
            "url": "https://protected-shop.com/item/99",
            "mode": "auto",
            "output_format": "html",
        }))
        .timeout(std::time::Duration::from_secs(120))
        .send()
        .await?;

    let body: ScrapeResponse = res.json().await?;
    if !body.success {
        anyhow::bail!("scrape failed");
    }

    let html = body.data.and_then(|d| d.content).unwrap_or_default();
    let document = Html::parse_document(&html);
    let price_sel = Selector::parse(".product-price").unwrap();
    let price = document.select(&price_sel).next()
        .map(|el| el.text().collect::<String>());

    println!("price: {:?}", price);
    if let Some(m) = body.metadata {
        println!("via {}", m.method_used);
    }
    Ok(())
}

6.Concurrent fetches with Tokio

futures::future::join_all or a stream with buffer_unordered processes URL lists. Handle per-URL Err without aborting the batch.

pool.rs

rust

12345678910111213141516171819202122use tokio::sync::Semaphore;
use std::sync::Arc;

let sem = Arc::new(Semaphore::new(5));
let urls = vec!["https://example.com/a", "https://example.com/b"];

let handles: Vec<_> = urls.into_iter().map(|url| {
    let client = client.clone();
    let sem = sem.clone();
    let key = api_key.clone();
    tokio::spawn(async move {
        let _permit = sem.acquire().await.unwrap();
        scrape_one(&client, &key, url).await
    })
}).collect();

for h in handles {
    match h.await? {
        Ok(data) => println!("ok: {:?}", data),
        Err(e) => eprintln!("err: {}", e),
    }
}

7.Skip scraper with css_extractor

When you only need a few fields, deserialize css_extracted into a struct and skip DOM walking entirely.

structured.json

rust

123456789.json(&serde_json::json!({
    "url": target,
    "mode": "auto",
    "output_format": "css_extractor",
    "css_selectors": {
        "title": "h1",
        "price": ".price"
    }
}))

8.js_rendering for client-rendered HTML

scraper parses static trees only. React SPAs need js_rendering with js_wait_selector — scraping JavaScript-rendered pages covers when to use it.

9.Result types in production

Map API failures to actionable variants instead of panicking:

401 — config error, return early
402 — budget exhausted, stop scheduler
429 — sleep with jitter, retry
502 — retry with cap
success:false — log URL to dead-letter store

Frequently asked questions

reqwest blocking or async?

Blocking for CLIs and quick tools. Async Tokio for services fanning out hundreds of OmniScrape calls.

scraper or select.rs?

scraper is the common choice with familiar CSS selectors. select.rs is lighter if you only need a few queries.

Should I build anti-bot bypass in Rust?

Only if bypass engineering is your product. Otherwise OmniScrape keeps your Rust code focused on parsing and storage.

How do I avoid compiling selectors every iteration?

Parse Selector::parse once at startup, clone into tasks, or use lazy_static/OnceLock.

hyper directly instead of reqwest?

hyper for maximal control. reqwest is ergonomic for JSON APIs like OmniScrape.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.