1.Gems you need
The core dependency pair is `faraday` for HTTP and `nokogiri` for HTML parsing. Add `faraday-retry` for middleware-based exponential backoff on transient errors like 502s — this keeps retry logic out of your worker code. `sidekiq` handles background job queuing and concurrency for production-scale crawls. Pin versions in your Gemfile and commit the lockfile so CI and production use identical native extensions, which matters for Nokogiri's C bindings.
1234567gem install faraday nokogiri faraday-retry
# Gemfile:
# gem "faraday", "~> 2.9"
# gem "faraday-retry", "~> 2.2"
# gem "nokogiri", "~> 1.16"
# gem "sidekiq", "~> 7.3" # background jobs
2.Fetch HTML with Faraday
Build a reusable connection object with explicit timeout values. Ruby's underlying `net/http` defaults are generous enough that a single hung connection can block a Sidekiq thread for minutes. Set `open_timeout` (TCP connect) separately from `timeout` (read). The `faraday-retry` middleware handles 502 and 503 responses with exponential backoff before your code ever sees them.
Always check `response.success?` before passing the body to Nokogiri — a 403 response body is often a bot-detection page, not real content, and parsing it silently produces garbage records.
123456789101112131415161718192021require "faraday"
require "faraday/retry"
conn = Faraday.new do |f|
f.request :retry, {
max: 3,
interval: 1.5,
interval_randomness: 0.5,
backoff_factor: 2,
retry_statuses: [429, 502, 503],
}
f.options.timeout = 30 # read timeout
f.options.open_timeout = 10 # TCP connect timeout
f.adapter Faraday.default_adapter
end
response = conn.get("https://books.toscrape.com/catalogue/page-1.html")
raise "HTTP #{response.status}" unless response.success?
html = response.body
puts "Fetched #{html.bytesize} bytes"
3.Parse HTML with Nokogiri
`Nokogiri::HTML.parse` builds a document from a raw HTML string, tolerating malformed markup gracefully. `css()` returns a `NodeSet` you can iterate; `at_css()` returns the first match or `nil` — always guard against nil when a field may be absent. Call `.text.strip` to collapse whitespace. Attribute access uses the familiar hash syntax: `node['href']`.
For deeply nested or positional selections, XPath is more expressive than CSS. Use `xpath()` or `at_xpath()` when you need to select a node based on its text content or traverse to a sibling element.
12345678910111213141516require "nokogiri"
doc = Nokogiri::HTML.parse(html)
books = []
doc.css("article.product_pod").each do |card|
books << {
title: card.at_css("h3 a")&.[]("title")&.strip,
price: card.at_css(".price_color")&.text&.strip,
in_stock: card.at_css(".instock")&.text&.include?("In stock") || false,
url: card.at_css("h3 a")&.[]("href"),
}
end
puts "Found #{books.size} books"
pp books.first(3)
4.Sidekiq for production-scale scraping
Never run scrape requests synchronously inside a Rails controller or ActiveRecord callback. Enqueue one Sidekiq job per URL or URL batch; workers call OmniScrape, parse the response, and write to the database. This keeps your web process response times fast and lets you tune scraping concurrency independently of your web worker count.
Set a conservative `retry` limit on the scraping queue. Sidekiq's default of 25 retries means a bad URL can consume credits for days. Use `sidekiq_options retry: 3` and a `sidekiq_retries_exhausted` block to dead-letter or alert on permanent failures. Separate the scraping queue from your default queue so a spike in scrape jobs does not delay user-facing background work.
1234567891011121314151617181920# app/workers/scrape_product_worker.rb
class ScrapeProductWorker
include Sidekiq::Worker
sidekiq_options retry: 3, queue: :scraping, dead: true
sidekiq_retries_exhausted do |msg, ex|
Bugsnag.notify(ex, { url: msg["args"].first })
FailedScrapeUrl.create!(url: msg["args"].first, error: ex.message)
end
def perform(url)
html = OmniScrape.fetch_html(url)
doc = Nokogiri::HTML.parse(html)
Product.upsert_from_doc!(doc, url)
end
end
# Enqueue a batch:
urls = ["https://shop.com/p/8821", "https://shop.com/p/9034"]
urls.each { |u| ScrapeProductWorker.perform_async(u) }
5.OmniScrape Faraday client module
Centralize all OmniScrape API calls in a small module so workers stay thin. Use `faraday`'s `:json` request middleware to serialize the body and `:json` response middleware to parse the reply — no manual `JSON.parse`. The API key comes from an environment variable; never hard-code it or commit it to source control.
Log `metadata.method_used` and `billing.charged` on every call. This gives you an audit trail for cost attribution per job class and lets you spot when a target site starts requiring JavaScript rendering, which costs more than a fast HTTP fetch.
123456789101112131415161718192021222324252627282930313233# lib/omniscrape.rb
module OmniScrape
API_BASE = "https://api.omniscrape.io"
def self.connection
@connection ||= Faraday.new(url: API_BASE) do |f|
f.request :json
f.response :json
f.options.timeout = 120
f.options.open_timeout = 15
f.headers["X-API-Key"] = ENV.fetch("OMNISCRAPE_KEY")
end
end
# Returns clean HTML string from data.content
def self.fetch_html(url, mode: "auto", **extra)
payload = { url: url, mode: mode, output_format: "html" }.merge(extra)
res = connection.post("/v1/scrape", payload)
body = res.body
raise "OmniScrape error (HTTP #{res.status}): #{body}" unless res.success?
raise "Scrape failed for #{url}: #{body}" unless body["success"]
Rails.logger.info(
"[OmniScrape] #{url} | method=#{body.dig('metadata', 'method_used')} " \
"solver=#{body.dig('metadata', 'solver_used')} " \
"cost=$#{body.dig('billing', 'charged')} " \
"balance=$#{body.dig('billing', 'balance_after')}"
)
body.dig("data", "content")
end
end
6.Structured extraction without Nokogiri
When target fields are stable across pages, use `output_format: "css_extractor"` with a `css_selectors` map. OmniScrape applies the selectors server-side and returns a hash in `data.css_extracted` — you get structured data directly without parsing HTML in Ruby at all. This is ideal for Sidekiq workers that write straight to ActiveRecord: fewer moving parts, no selector logic to maintain in Ruby.
The selector map keys become the keys of the returned hash. If a selector matches nothing, the key is present with a `nil` value rather than absent, so your upsert logic can rely on consistent structure.
1234567891011121314151617181920res = OmniScrape.connection.post("/v1/scrape", {
url: "https://protected-shop.com/item/12",
mode: "auto",
enable_solver: true,
output_format: "css_extractor",
css_selectors: {
title: "h1.product-name",
price: ".price-current",
sku: "[data-sku]",
description: ".product-description p:first-child",
image_url: "img.product-hero@src",
},
})
raise "Failed" unless res.body["success"]
fields = res.body.dig("data", "css_extracted")
# => { "title" => "...", "price" => "...", "sku" => "...", ... }
Product.upsert_from_api!(fields)
7.Paginating open catalogs
For sites without bot protection, loop page numbers directly with Faraday until you hit a 404 or an empty result set. Add a `sleep` between requests to avoid hammering the server — even on open sites, aggressive crawling can get your IP blocked or trigger rate limits that complicate later scraping.
For paginated protected sites, pass each page URL through `OmniScrape.fetch_html` instead of the direct Faraday connection. The rest of the loop logic stays identical. Enqueue pages as Sidekiq jobs rather than looping synchronously when the catalog is large.
123456789101112131415161718192021222324252627all = []
page = 1
loop do
url = "https://books.toscrape.com/catalogue/page-#{page}.html"
res = conn.get(url)
break if res.status == 404
break unless res.success?
doc = Nokogiri::HTML.parse(res.body)
cards = doc.css("article.product_pod")
break if cards.empty?
cards.each do |c|
all << {
title: c.at_css("h3 a")&.[]("title"),
price: c.at_css(".price_color")&.text&.strip,
}
end
puts "Page #{page}: #{all.size} total books collected"
page += 1
sleep 2
end
puts "Done — #{all.size} books"
8.JavaScript-rendered storefronts
Nokogiri parses the static HTML that Faraday fetches. If prices, inventory counts, or product listings are injected by React, Vue, or a similar framework after the initial page load, Faraday returns the shell HTML and Nokogiri finds nothing. Pass the URL to OmniScrape with `mode: "js_rendering"` to run a headless browser that executes JavaScript before returning the rendered HTML.
Use `js_wait_selector` to tell the browser to wait until a specific element is present in the DOM before capturing the page. This is more reliable than a fixed `js_wait_timeout` because it adapts to variable load times. See scraping JavaScript-rendered pages for a deeper breakdown.
123456789101112131415161718# Basic JS rendering — waits for default timeout
html = OmniScrape.fetch_html(
"https://spa-store.com/catalog",
mode: "js_rendering"
)
# More reliable: wait for a specific element to appear in the DOM
res = OmniScrape.connection.post("/v1/scrape", {
url: "https://spa-store.com/catalog",
mode: "js_rendering",
output_format: "html",
js_wait_selector: "[data-testid='product-grid']",
js_wait_timeout: 8000,
})
html = res.body.dig("data", "content")
doc = Nokogiri::HTML.parse(html)
doc.css("[data-testid='product-card']").each { |c| puts c.at_css("h2")&.text }
9.Retries, error handling, and failure modes
Sidekiq retry is not free — each retry on a scraping job may consume API credits and delay other work. Configure explicit limits and handle each error class differently rather than letting Sidekiq retry everything blindly:
- 401 Unauthorized — raise a non-retryable error class; the key is wrong or missing. Fix ENV and redeploy, do not retry.
- 402 Payment Required — pause the scraping queue immediately and alert your billing contact. Retrying burns no credits but wastes queue capacity.
- 429 Too Many Requests — back off in the Faraday retry middleware before Sidekiq ever sees the failure. Use `interval_randomness` to spread retries across workers.
- 502 / 503 Bad Gateway — safe to retry with jitter; usually a transient upstream issue. Limit to 3–5 attempts.
- success: false in response body — the request completed but the target returned unusable content (challenge page slipped through, empty body, etc.). Dead-letter the URL for manual review rather than retrying automatically.
- Nokogiri returns nil for expected selectors — the page structure changed. Alert and pause the job class; retrying will produce the same empty records.
Frequently asked questions
Should I use Faraday or HTTParty for scraping?
Faraday. Its middleware stack lets you add retry logic, JSON encoding, logging, and custom headers as composable layers without changing your core request code. HTTParty is fine for quick one-off scripts, but for production pipelines with Sidekiq the middleware model pays off quickly. Avoid open-uri for anything beyond local development — it has no timeout control and encourages unsafe patterns.
When should I use Nokogiri CSS selectors versus XPath?
Use CSS selectors for the majority of cases — product cards, price elements, links. They are shorter and easier to read. Switch to XPath when you need to select a node based on its text content (e.g., `//td[text()='Price']`), traverse to a parent or sibling, or express positional conditions that CSS cannot handle. Nokogiri supports both on the same document.
How many Sidekiq threads should I allocate to the scraping queue?
Start with a concurrency of 5–10 on a dedicated scraping queue. Monitor OmniScrape 429 responses and `billing.charged` per job before increasing. Scraping jobs are mostly I/O-bound, so higher concurrency is viable, but each thread holds a Faraday connection open and counts toward your API rate limits. Separate the scraping queue from your default queue so a burst of scrape jobs does not starve user-facing work.
Can I use Mechanize instead of Faraday and Nokogiri?
Mechanize wraps both HTTP and HTML parsing and can follow links and submit forms, which is convenient for simple crawls. However, it does not handle modern TLS configurations well, has no middleware system for retries or logging, and cannot integrate cleanly with Sidekiq workers. For anything beyond internal tools on controlled HTTP servers, use Faraday plus Nokogiri explicitly.
How do I handle sites that require cookies or session state?
Use OmniScrape's `session_id` field to persist a browser session across requests. Pass the same `session_id` string on each call to the API and the headless browser will reuse cookies and local storage from the previous request. For direct Faraday calls to open sites, use a `Faraday::CookieJar` middleware or manage the `Set-Cookie` / `Cookie` headers manually.
What is the right way to store scraped data in Rails?
Use `upsert_all` with a unique constraint on a natural key (URL, SKU, or external ID) rather than `find_or_create_by` in a loop. This avoids N+1 database round-trips and handles concurrent workers writing the same record safely. Wrap large bulk inserts in a transaction and consider a staging table if you need to diff against existing records before committing.
How do I avoid scraping the same URL twice across parallel Sidekiq workers?
Use a Redis set to track enqueued or completed URLs before pushing to Sidekiq. A simple `SADD` with the URL returns 0 if the URL was already present, so you can skip `perform_async`. Alternatively, use a database-level unique index on a `scraped_urls` table and rescue `ActiveRecord::RecordNotUnique` in the worker. For large crawls, a Bloom filter in Redis is more memory-efficient than a full set.
Related guides