Web Scraping with Ruby: Faraday, Nokogiri, Sidekiq & OmniScrape

Ruby is a strong choice for production scraping pipelines. Nokogiri turns raw HTML into a queryable document tree — CSS selectors for most tasks, XPath when you need to traverse siblings or text nodes. Faraday gives you a composable HTTP client with middleware for retries, JSON encoding, and header management, without the TLS footguns that haunt old open-uri tutorials.

Rails teams push scrape work into Sidekiq so web requests stay fast and failures are isolated. When Faraday gets back a Cloudflare challenge page or a CAPTCHA wall instead of real HTML, hand the URL to the OmniScrape API and keep your Nokogiri selectors unchanged — the API returns clean HTML in `data.content`. Our Cloudflare bypass guide covers why challenge pages defeat direct HTTP clients. For the same patterns in another language, see web scraping with Python.

1.Gems you need

The core dependency pair is `faraday` for HTTP and `nokogiri` for HTML parsing. Add `faraday-retry` for middleware-based exponential backoff on transient errors like 502s — this keeps retry logic out of your worker code. `sidekiq` handles background job queuing and concurrency for production-scale crawls. Pin versions in your Gemfile and commit the lockfile so CI and production use identical native extensions, which matters for Nokogiri's C bindings.

terminal

bash

1234567gem install faraday nokogiri faraday-retry

# Gemfile:
# gem "faraday",       "~> 2.9"
# gem "faraday-retry", "~> 2.2"
# gem "nokogiri",      "~> 1.16"
# gem "sidekiq",       "~> 7.3"   # background jobs

2.Fetch HTML with Faraday

Build a reusable connection object with explicit timeout values. Ruby's underlying `net/http` defaults are generous enough that a single hung connection can block a Sidekiq thread for minutes. Set `open_timeout` (TCP connect) separately from `timeout` (read). The `faraday-retry` middleware handles 502 and 503 responses with exponential backoff before your code ever sees them.

Always check `response.success?` before passing the body to Nokogiri — a 403 response body is often a bot-detection page, not real content, and parsing it silently produces garbage records.

fetch.rb

ruby

123456789101112131415161718192021require "faraday"
require "faraday/retry"

conn = Faraday.new do |f|
  f.request :retry, {
    max: 3,
    interval: 1.5,
    interval_randomness: 0.5,
    backoff_factor: 2,
    retry_statuses: [429, 502, 503],
  }
  f.options.timeout      = 30   # read timeout
  f.options.open_timeout = 10   # TCP connect timeout
  f.adapter Faraday.default_adapter
end

response = conn.get("https://books.toscrape.com/catalogue/page-1.html")
raise "HTTP #{response.status}" unless response.success?

html = response.body
puts "Fetched #{html.bytesize} bytes"

3.Parse HTML with Nokogiri

`Nokogiri::HTML.parse` builds a document from a raw HTML string, tolerating malformed markup gracefully. `css()` returns a `NodeSet` you can iterate; `at_css()` returns the first match or `nil` — always guard against nil when a field may be absent. Call `.text.strip` to collapse whitespace. Attribute access uses the familiar hash syntax: `node['href']`.

For deeply nested or positional selections, XPath is more expressive than CSS. Use `xpath()` or `at_xpath()` when you need to select a node based on its text content or traverse to a sibling element.

parse.rb

ruby

12345678910111213141516require "nokogiri"

doc = Nokogiri::HTML.parse(html)
books = []

doc.css("article.product_pod").each do |card|
  books << {
    title:    card.at_css("h3 a")&.[]("title")&.strip,
    price:    card.at_css(".price_color")&.text&.strip,
    in_stock: card.at_css(".instock")&.text&.include?("In stock") || false,
    url:      card.at_css("h3 a")&.[]("href"),
  }
end

puts "Found #{books.size} books"
pp books.first(3)

4.Sidekiq for production-scale scraping

Never run scrape requests synchronously inside a Rails controller or ActiveRecord callback. Enqueue one Sidekiq job per URL or URL batch; workers call OmniScrape, parse the response, and write to the database. This keeps your web process response times fast and lets you tune scraping concurrency independently of your web worker count.

Set a conservative `retry` limit on the scraping queue. Sidekiq's default of 25 retries means a bad URL can consume credits for days. Use `sidekiq_options retry: 3` and a `sidekiq_retries_exhausted` block to dead-letter or alert on permanent failures. Separate the scraping queue from your default queue so a spike in scrape jobs does not delay user-facing background work.

scrape_product_worker.rb

ruby

1234567891011121314151617181920# app/workers/scrape_product_worker.rb
class ScrapeProductWorker
  include Sidekiq::Worker
  sidekiq_options retry: 3, queue: :scraping, dead: true

  sidekiq_retries_exhausted do |msg, ex|
    Bugsnag.notify(ex, { url: msg["args"].first })
    FailedScrapeUrl.create!(url: msg["args"].first, error: ex.message)
  end

  def perform(url)
    html = OmniScrape.fetch_html(url)
    doc  = Nokogiri::HTML.parse(html)
    Product.upsert_from_doc!(doc, url)
  end
end

# Enqueue a batch:
urls = ["https://shop.com/p/8821", "https://shop.com/p/9034"]
urls.each { |u| ScrapeProductWorker.perform_async(u) }

5.OmniScrape Faraday client module

Centralize all OmniScrape API calls in a small module so workers stay thin. Use `faraday`'s `:json` request middleware to serialize the body and `:json` response middleware to parse the reply — no manual `JSON.parse`. The API key comes from an environment variable; never hard-code it or commit it to source control.

Log `metadata.method_used` and `billing.charged` on every call. This gives you an audit trail for cost attribution per job class and lets you spot when a target site starts requiring JavaScript rendering, which costs more than a fast HTTP fetch.

lib/omniscrape.rb

ruby

123456789101112131415161718192021222324252627282930313233# lib/omniscrape.rb
module OmniScrape
  API_BASE = "https://api.omniscrape.io"

  def self.connection
    @connection ||= Faraday.new(url: API_BASE) do |f|
      f.request  :json
      f.response :json
      f.options.timeout      = 120
      f.options.open_timeout = 15
      f.headers["X-API-Key"] = ENV.fetch("OMNISCRAPE_KEY")
    end
  end

  # Returns clean HTML string from data.content
  def self.fetch_html(url, mode: "auto", **extra)
    payload = { url: url, mode: mode, output_format: "html" }.merge(extra)
    res     = connection.post("/v1/scrape", payload)
    body    = res.body

    raise "OmniScrape error (HTTP #{res.status}): #{body}" unless res.success?
    raise "Scrape failed for #{url}: #{body}" unless body["success"]

    Rails.logger.info(
      "[OmniScrape] #{url} | method=#{body.dig('metadata', 'method_used')} " \
      "solver=#{body.dig('metadata', 'solver_used')} " \
      "cost=$#{body.dig('billing', 'charged')} " \
      "balance=$#{body.dig('billing', 'balance_after')}"
    )

    body.dig("data", "content")
  end
end

6.Structured extraction without Nokogiri

When target fields are stable across pages, use `output_format: "css_extractor"` with a `css_selectors` map. OmniScrape applies the selectors server-side and returns a hash in `data.css_extracted` — you get structured data directly without parsing HTML in Ruby at all. This is ideal for Sidekiq workers that write straight to ActiveRecord: fewer moving parts, no selector logic to maintain in Ruby.

The selector map keys become the keys of the returned hash. If a selector matches nothing, the key is present with a `nil` value rather than absent, so your upsert logic can rely on consistent structure.

structured.rb

ruby

1234567891011121314151617181920res = OmniScrape.connection.post("/v1/scrape", {
  url:           "https://protected-shop.com/item/12",
  mode:          "auto",
  enable_solver: true,
  output_format: "css_extractor",
  css_selectors: {
    title:       "h1.product-name",
    price:       ".price-current",
    sku:         "[data-sku]",
    description: ".product-description p:first-child",
    image_url:   "img.product-hero@src",
  },
})

raise "Failed" unless res.body["success"]

fields = res.body.dig("data", "css_extracted")
# => { "title" => "...", "price" => "...", "sku" => "...", ... }

Product.upsert_from_api!(fields)

7.Paginating open catalogs

For sites without bot protection, loop page numbers directly with Faraday until you hit a 404 or an empty result set. Add a `sleep` between requests to avoid hammering the server — even on open sites, aggressive crawling can get your IP blocked or trigger rate limits that complicate later scraping.

For paginated protected sites, pass each page URL through `OmniScrape.fetch_html` instead of the direct Faraday connection. The rest of the loop logic stays identical. Enqueue pages as Sidekiq jobs rather than looping synchronously when the catalog is large.

paginate.rb

ruby

123456789101112131415161718192021222324252627all  = []
page = 1

loop do
  url = "https://books.toscrape.com/catalogue/page-#{page}.html"
  res = conn.get(url)

  break if res.status == 404
  break unless res.success?

  doc   = Nokogiri::HTML.parse(res.body)
  cards = doc.css("article.product_pod")
  break if cards.empty?

  cards.each do |c|
    all << {
      title: c.at_css("h3 a")&.[]("title"),
      price: c.at_css(".price_color")&.text&.strip,
    }
  end

  puts "Page #{page}: #{all.size} total books collected"
  page += 1
  sleep 2
end

puts "Done — #{all.size} books"

8.JavaScript-rendered storefronts

Nokogiri parses the static HTML that Faraday fetches. If prices, inventory counts, or product listings are injected by React, Vue, or a similar framework after the initial page load, Faraday returns the shell HTML and Nokogiri finds nothing. Pass the URL to OmniScrape with `mode: "js_rendering"` to run a headless browser that executes JavaScript before returning the rendered HTML.

Use `js_wait_selector` to tell the browser to wait until a specific element is present in the DOM before capturing the page. This is more reliable than a fixed `js_wait_timeout` because it adapts to variable load times. See scraping JavaScript-rendered pages for a deeper breakdown.

spa.rb

ruby

123456789101112131415161718# Basic JS rendering — waits for default timeout
html = OmniScrape.fetch_html(
  "https://spa-store.com/catalog",
  mode: "js_rendering"
)

# More reliable: wait for a specific element to appear in the DOM
res = OmniScrape.connection.post("/v1/scrape", {
  url:              "https://spa-store.com/catalog",
  mode:             "js_rendering",
  output_format:    "html",
  js_wait_selector: "[data-testid='product-grid']",
  js_wait_timeout:  8000,
})

html = res.body.dig("data", "content")
doc  = Nokogiri::HTML.parse(html)
doc.css("[data-testid='product-card']").each { |c| puts c.at_css("h2")&.text }

9.Retries, error handling, and failure modes

Sidekiq retry is not free — each retry on a scraping job may consume API credits and delay other work. Configure explicit limits and handle each error class differently rather than letting Sidekiq retry everything blindly:

401 Unauthorized — raise a non-retryable error class; the key is wrong or missing. Fix ENV and redeploy, do not retry.
402 Payment Required — pause the scraping queue immediately and alert your billing contact. Retrying burns no credits but wastes queue capacity.
429 Too Many Requests — back off in the Faraday retry middleware before Sidekiq ever sees the failure. Use `interval_randomness` to spread retries across workers.
502 / 503 Bad Gateway — safe to retry with jitter; usually a transient upstream issue. Limit to 3–5 attempts.
success: false in response body — the request completed but the target returned unusable content (challenge page slipped through, empty body, etc.). Dead-letter the URL for manual review rather than retrying automatically.
Nokogiri returns nil for expected selectors — the page structure changed. Alert and pause the job class; retrying will produce the same empty records.

Frequently asked questions

Should I use Faraday or HTTParty for scraping?

Faraday. Its middleware stack lets you add retry logic, JSON encoding, logging, and custom headers as composable layers without changing your core request code. HTTParty is fine for quick one-off scripts, but for production pipelines with Sidekiq the middleware model pays off quickly. Avoid open-uri for anything beyond local development — it has no timeout control and encourages unsafe patterns.

When should I use Nokogiri CSS selectors versus XPath?

Use CSS selectors for the majority of cases — product cards, price elements, links. They are shorter and easier to read. Switch to XPath when you need to select a node based on its text content (e.g., `//td[text()='Price']`), traverse to a parent or sibling, or express positional conditions that CSS cannot handle. Nokogiri supports both on the same document.

How many Sidekiq threads should I allocate to the scraping queue?

Start with a concurrency of 5–10 on a dedicated scraping queue. Monitor OmniScrape 429 responses and `billing.charged` per job before increasing. Scraping jobs are mostly I/O-bound, so higher concurrency is viable, but each thread holds a Faraday connection open and counts toward your API rate limits. Separate the scraping queue from your default queue so a burst of scrape jobs does not starve user-facing work.

Can I use Mechanize instead of Faraday and Nokogiri?

Mechanize wraps both HTTP and HTML parsing and can follow links and submit forms, which is convenient for simple crawls. However, it does not handle modern TLS configurations well, has no middleware system for retries or logging, and cannot integrate cleanly with Sidekiq workers. For anything beyond internal tools on controlled HTTP servers, use Faraday plus Nokogiri explicitly.

How do I handle sites that require cookies or session state?

Use OmniScrape's `session_id` field to persist a browser session across requests. Pass the same `session_id` string on each call to the API and the headless browser will reuse cookies and local storage from the previous request. For direct Faraday calls to open sites, use a `Faraday::CookieJar` middleware or manage the `Set-Cookie` / `Cookie` headers manually.

What is the right way to store scraped data in Rails?

Use `upsert_all` with a unique constraint on a natural key (URL, SKU, or external ID) rather than `find_or_create_by` in a loop. This avoids N+1 database round-trips and handles concurrent workers writing the same record safely. Wrap large bulk inserts in a transaction and consider a staging table if you need to diff against existing records before committing.

How do I avoid scraping the same URL twice across parallel Sidekiq workers?

Use a Redis set to track enqueued or completed URLs before pushing to Sidekiq. A simple `SADD` with the URL returns 0 if the URL was already present, so you can skip `perform_async`. Alternatively, use a database-level unique index on a `scraped_urls` table and rescue `ActiveRecord::RecordNotUnique` in the worker. For large crawls, a Bloom filter in Redis is more memory-efficient than a full set.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.Gems you need

terminal

bash

1234567gem install faraday nokogiri faraday-retry

# Gemfile:
# gem "faraday",       "~> 2.9"
# gem "faraday-retry", "~> 2.2"
# gem "nokogiri",      "~> 1.16"
# gem "sidekiq",       "~> 7.3"   # background jobs

2.Fetch HTML with Faraday

Always check `response.success?` before passing the body to Nokogiri — a 403 response body is often a bot-detection page, not real content, and parsing it silently produces garbage records.

fetch.rb

ruby

123456789101112131415161718192021require "faraday"
require "faraday/retry"

conn = Faraday.new do |f|
  f.request :retry, {
    max: 3,
    interval: 1.5,
    interval_randomness: 0.5,
    backoff_factor: 2,
    retry_statuses: [429, 502, 503],
  }
  f.options.timeout      = 30   # read timeout
  f.options.open_timeout = 10   # TCP connect timeout
  f.adapter Faraday.default_adapter
end

response = conn.get("https://books.toscrape.com/catalogue/page-1.html")
raise "HTTP #{response.status}" unless response.success?

html = response.body
puts "Fetched #{html.bytesize} bytes"

3.Parse HTML with Nokogiri

For deeply nested or positional selections, XPath is more expressive than CSS. Use `xpath()` or `at_xpath()` when you need to select a node based on its text content or traverse to a sibling element.

parse.rb

ruby

12345678910111213141516require "nokogiri"

doc = Nokogiri::HTML.parse(html)
books = []

doc.css("article.product_pod").each do |card|
  books << {
    title:    card.at_css("h3 a")&.[]("title")&.strip,
    price:    card.at_css(".price_color")&.text&.strip,
    in_stock: card.at_css(".instock")&.text&.include?("In stock") || false,
    url:      card.at_css("h3 a")&.[]("href"),
  }
end

puts "Found #{books.size} books"
pp books.first(3)

4.Sidekiq for production-scale scraping

scrape_product_worker.rb

ruby

1234567891011121314151617181920# app/workers/scrape_product_worker.rb
class ScrapeProductWorker
  include Sidekiq::Worker
  sidekiq_options retry: 3, queue: :scraping, dead: true

  sidekiq_retries_exhausted do |msg, ex|
    Bugsnag.notify(ex, { url: msg["args"].first })
    FailedScrapeUrl.create!(url: msg["args"].first, error: ex.message)
  end

  def perform(url)
    html = OmniScrape.fetch_html(url)
    doc  = Nokogiri::HTML.parse(html)
    Product.upsert_from_doc!(doc, url)
  end
end

# Enqueue a batch:
urls = ["https://shop.com/p/8821", "https://shop.com/p/9034"]
urls.each { |u| ScrapeProductWorker.perform_async(u) }

5.OmniScrape Faraday client module

lib/omniscrape.rb

ruby

123456789101112131415161718192021222324252627282930313233# lib/omniscrape.rb
module OmniScrape
  API_BASE = "https://api.omniscrape.io"

  def self.connection
    @connection ||= Faraday.new(url: API_BASE) do |f|
      f.request  :json
      f.response :json
      f.options.timeout      = 120
      f.options.open_timeout = 15
      f.headers["X-API-Key"] = ENV.fetch("OMNISCRAPE_KEY")
    end
  end

  # Returns clean HTML string from data.content
  def self.fetch_html(url, mode: "auto", **extra)
    payload = { url: url, mode: mode, output_format: "html" }.merge(extra)
    res     = connection.post("/v1/scrape", payload)
    body    = res.body

    raise "OmniScrape error (HTTP #{res.status}): #{body}" unless res.success?
    raise "Scrape failed for #{url}: #{body}" unless body["success"]

    Rails.logger.info(
      "[OmniScrape] #{url} | method=#{body.dig('metadata', 'method_used')} " \
      "solver=#{body.dig('metadata', 'solver_used')} " \
      "cost=$#{body.dig('billing', 'charged')} " \
      "balance=$#{body.dig('billing', 'balance_after')}"
    )

    body.dig("data", "content")
  end
end

6.Structured extraction without Nokogiri

structured.rb

ruby

1234567891011121314151617181920res = OmniScrape.connection.post("/v1/scrape", {
  url:           "https://protected-shop.com/item/12",
  mode:          "auto",
  enable_solver: true,
  output_format: "css_extractor",
  css_selectors: {
    title:       "h1.product-name",
    price:       ".price-current",
    sku:         "[data-sku]",
    description: ".product-description p:first-child",
    image_url:   "img.product-hero@src",
  },
})

raise "Failed" unless res.body["success"]

fields = res.body.dig("data", "css_extracted")
# => { "title" => "...", "price" => "...", "sku" => "...", ... }

Product.upsert_from_api!(fields)

7.Paginating open catalogs

paginate.rb

ruby

123456789101112131415161718192021222324252627all  = []
page = 1

loop do
  url = "https://books.toscrape.com/catalogue/page-#{page}.html"
  res = conn.get(url)

  break if res.status == 404
  break unless res.success?

  doc   = Nokogiri::HTML.parse(res.body)
  cards = doc.css("article.product_pod")
  break if cards.empty?

  cards.each do |c|
    all << {
      title: c.at_css("h3 a")&.[]("title"),
      price: c.at_css(".price_color")&.text&.strip,
    }
  end

  puts "Page #{page}: #{all.size} total books collected"
  page += 1
  sleep 2
end

puts "Done — #{all.size} books"

8.JavaScript-rendered storefronts

spa.rb

ruby

123456789101112131415161718# Basic JS rendering — waits for default timeout
html = OmniScrape.fetch_html(
  "https://spa-store.com/catalog",
  mode: "js_rendering"
)

# More reliable: wait for a specific element to appear in the DOM
res = OmniScrape.connection.post("/v1/scrape", {
  url:              "https://spa-store.com/catalog",
  mode:             "js_rendering",
  output_format:    "html",
  js_wait_selector: "[data-testid='product-grid']",
  js_wait_timeout:  8000,
})

html = res.body.dig("data", "content")
doc  = Nokogiri::HTML.parse(html)
doc.css("[data-testid='product-card']").each { |c| puts c.at_css("h2")&.text }

9.Retries, error handling, and failure modes

401 Unauthorized — raise a non-retryable error class; the key is wrong or missing. Fix ENV and redeploy, do not retry.
402 Payment Required — pause the scraping queue immediately and alert your billing contact. Retrying burns no credits but wastes queue capacity.
429 Too Many Requests — back off in the Faraday retry middleware before Sidekiq ever sees the failure. Use `interval_randomness` to spread retries across workers.
502 / 503 Bad Gateway — safe to retry with jitter; usually a transient upstream issue. Limit to 3–5 attempts.
success: false in response body — the request completed but the target returned unusable content (challenge page slipped through, empty body, etc.). Dead-letter the URL for manual review rather than retrying automatically.
Nokogiri returns nil for expected selectors — the page structure changed. Alert and pause the job class; retrying will produce the same empty records.

Frequently asked questions

Should I use Faraday or HTTParty for scraping?

When should I use Nokogiri CSS selectors versus XPath?

How many Sidekiq threads should I allocate to the scraping queue?

Can I use Mechanize instead of Faraday and Nokogiri?

How do I handle sites that require cookies or session state?

What is the right way to store scraped data in Rails?

How do I avoid scraping the same URL twice across parallel Sidekiq workers?

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.