OmniScrape
ProductsSolutionsGuidesDocs ↗PricingAbout
ProductsSolutionsGuidesDocs ↗PricingAbout
← All guides
Web Scraping by Language

Web Scraping with Ruby: Faraday, Nokogiri, Sidekiq & OmniScrape

Ruby is a strong choice for production scraping pipelines. Nokogiri turns raw HTML into a queryable document tree — CSS selectors for most tasks, XPath when you need to traverse siblings or text nodes. Faraday gives you a composable HTTP client with middleware for retries, JSON encoding, and header management, without the TLS footguns that haunt old open-uri tutorials.

Rails teams push scrape work into Sidekiq so web requests stay fast and failures are isolated. When Faraday gets back a Cloudflare challenge page or a CAPTCHA wall instead of real HTML, hand the URL to the OmniScrape API and keep your Nokogiri selectors unchanged — the API returns clean HTML in `data.content`. Our Cloudflare bypass guide covers why challenge pages defeat direct HTTP clients. For the same patterns in another language, see web scraping with Python.

On this page

1. Gems you need2. Fetch HTML with Faraday3. Parse HTML with Nokogiri4. Sidekiq for production-scale scraping5. OmniScrape Faraday client module6. Structured extraction without Nokogiri7. Paginating open catalogs8. JavaScript-rendered storefronts9. Retries, error handling, and failure modes10. FAQ

1.Gems you need

The core dependency pair is `faraday` for HTTP and `nokogiri` for HTML parsing. Add `faraday-retry` for middleware-based exponential backoff on transient errors like 502s — this keeps retry logic out of your worker code. `sidekiq` handles background job queuing and concurrency for production-scale crawls. Pin versions in your Gemfile and commit the lockfile so CI and production use identical native extensions, which matters for Nokogiri's C bindings.

terminal
bash
1234567gem install faraday nokogiri faraday-retry

# Gemfile:
# gem "faraday",       "~> 2.9"
# gem "faraday-retry", "~> 2.2"
# gem "nokogiri",      "~> 1.16"
# gem "sidekiq",       "~> 7.3"   # background jobs

2.Fetch HTML with Faraday

Build a reusable connection object with explicit timeout values. Ruby's underlying `net/http` defaults are generous enough that a single hung connection can block a Sidekiq thread for minutes. Set `open_timeout` (TCP connect) separately from `timeout` (read). The `faraday-retry` middleware handles 502 and 503 responses with exponential backoff before your code ever sees them.

Always check `response.success?` before passing the body to Nokogiri — a 403 response body is often a bot-detection page, not real content, and parsing it silently produces garbage records.

fetch.rb
ruby
123456789101112131415161718192021require "faraday"
require "faraday/retry"

conn = Faraday.new do |f|
  f.request :retry, {
    max: 3,
    interval: 1.5,
    interval_randomness: 0.5,
    backoff_factor: 2,
    retry_statuses: [429, 502, 503],
  }
  f.options.timeout      = 30   # read timeout
  f.options.open_timeout = 10   # TCP connect timeout
  f.adapter Faraday.default_adapter
end

response = conn.get("https://books.toscrape.com/catalogue/page-1.html")
raise "HTTP #{response.status}" unless response.success?

html = response.body
puts "Fetched #{html.bytesize} bytes"

3.Parse HTML with Nokogiri

`Nokogiri::HTML.parse` builds a document from a raw HTML string, tolerating malformed markup gracefully. `css()` returns a `NodeSet` you can iterate; `at_css()` returns the first match or `nil` — always guard against nil when a field may be absent. Call `.text.strip` to collapse whitespace. Attribute access uses the familiar hash syntax: `node['href']`.

For deeply nested or positional selections, XPath is more expressive than CSS. Use `xpath()` or `at_xpath()` when you need to select a node based on its text content or traverse to a sibling element.

parse.rb
ruby
12345678910111213141516require "nokogiri"

doc = Nokogiri::HTML.parse(html)
books = []

doc.css("article.product_pod").each do |card|
  books << {
    title:    card.at_css("h3 a")&.[]("title")&.strip,
    price:    card.at_css(".price_color")&.text&.strip,
    in_stock: card.at_css(".instock")&.text&.include?("In stock") || false,
    url:      card.at_css("h3 a")&.[]("href"),
  }
end

puts "Found #{books.size} books"
pp books.first(3)

4.Sidekiq for production-scale scraping

Never run scrape requests synchronously inside a Rails controller or ActiveRecord callback. Enqueue one Sidekiq job per URL or URL batch; workers call OmniScrape, parse the response, and write to the database. This keeps your web process response times fast and lets you tune scraping concurrency independently of your web worker count.

Set a conservative `retry` limit on the scraping queue. Sidekiq's default of 25 retries means a bad URL can consume credits for days. Use `sidekiq_options retry: 3` and a `sidekiq_retries_exhausted` block to dead-letter or alert on permanent failures. Separate the scraping queue from your default queue so a spike in scrape jobs does not delay user-facing background work.

scrape_product_worker.rb
ruby
1234567891011121314151617181920# app/workers/scrape_product_worker.rb
class ScrapeProductWorker
  include Sidekiq::Worker
  sidekiq_options retry: 3, queue: :scraping, dead: true

  sidekiq_retries_exhausted do |msg, ex|
    Bugsnag.notify(ex, { url: msg["args"].first })
    FailedScrapeUrl.create!(url: msg["args"].first, error: ex.message)
  end

  def perform(url)
    html = OmniScrape.fetch_html(url)
    doc  = Nokogiri::HTML.parse(html)
    Product.upsert_from_doc!(doc, url)
  end
end

# Enqueue a batch:
urls = ["https://shop.com/p/8821", "https://shop.com/p/9034"]
urls.each { |u| ScrapeProductWorker.perform_async(u) }

5.OmniScrape Faraday client module

Centralize all OmniScrape API calls in a small module so workers stay thin. Use `faraday`'s `:json` request middleware to serialize the body and `:json` response middleware to parse the reply — no manual `JSON.parse`. The API key comes from an environment variable; never hard-code it or commit it to source control.

Log `metadata.method_used` and `billing.charged` on every call. This gives you an audit trail for cost attribution per job class and lets you spot when a target site starts requiring JavaScript rendering, which costs more than a fast HTTP fetch.

lib/omniscrape.rb
ruby
123456789101112131415161718192021222324252627282930313233# lib/omniscrape.rb
module OmniScrape
  API_BASE = "https://api.omniscrape.io"

  def self.connection
    @connection ||= Faraday.new(url: API_BASE) do |f|
      f.request  :json
      f.response :json
      f.options.timeout      = 120
      f.options.open_timeout = 15
      f.headers["X-API-Key"] = ENV.fetch("OMNISCRAPE_KEY")
    end
  end

  # Returns clean HTML string from data.content
  def self.fetch_html(url, mode: "auto", **extra)
    payload = { url: url, mode: mode, output_format: "html" }.merge(extra)
    res     = connection.post("/v1/scrape", payload)
    body    = res.body

    raise "OmniScrape error (HTTP #{res.status}): #{body}" unless res.success?
    raise "Scrape failed for #{url}: #{body}" unless body["success"]

    Rails.logger.info(
      "[OmniScrape] #{url} | method=#{body.dig('metadata', 'method_used')} " \
      "solver=#{body.dig('metadata', 'solver_used')} " \
      "cost=$#{body.dig('billing', 'charged')} " \
      "balance=$#{body.dig('billing', 'balance_after')}"
    )

    body.dig("data", "content")
  end
end

6.Structured extraction without Nokogiri

When target fields are stable across pages, use `output_format: "css_extractor"` with a `css_selectors` map. OmniScrape applies the selectors server-side and returns a hash in `data.css_extracted` — you get structured data directly without parsing HTML in Ruby at all. This is ideal for Sidekiq workers that write straight to ActiveRecord: fewer moving parts, no selector logic to maintain in Ruby.

The selector map keys become the keys of the returned hash. If a selector matches nothing, the key is present with a `nil` value rather than absent, so your upsert logic can rely on consistent structure.

structured.rb
ruby
1234567891011121314151617181920res = OmniScrape.connection.post("/v1/scrape", {
  url:           "https://protected-shop.com/item/12",
  mode:          "auto",
  enable_solver: true,
  output_format: "css_extractor",
  css_selectors: {
    title:       "h1.product-name",
    price:       ".price-current",
    sku:         "[data-sku]",
    description: ".product-description p:first-child",
    image_url:   "img.product-hero@src",
  },
})

raise "Failed" unless res.body["success"]

fields = res.body.dig("data", "css_extracted")
# => { "title" => "...", "price" => "...", "sku" => "...", ... }

Product.upsert_from_api!(fields)

7.Paginating open catalogs

For sites without bot protection, loop page numbers directly with Faraday until you hit a 404 or an empty result set. Add a `sleep` between requests to avoid hammering the server — even on open sites, aggressive crawling can get your IP blocked or trigger rate limits that complicate later scraping.

For paginated protected sites, pass each page URL through `OmniScrape.fetch_html` instead of the direct Faraday connection. The rest of the loop logic stays identical. Enqueue pages as Sidekiq jobs rather than looping synchronously when the catalog is large.

paginate.rb
ruby
123456789101112131415161718192021222324252627all  = []
page = 1

loop do
  url = "https://books.toscrape.com/catalogue/page-#{page}.html"
  res = conn.get(url)

  break if res.status == 404
  break unless res.success?

  doc   = Nokogiri::HTML.parse(res.body)
  cards = doc.css("article.product_pod")
  break if cards.empty?

  cards.each do |c|
    all << {
      title: c.at_css("h3 a")&.[]("title"),
      price: c.at_css(".price_color")&.text&.strip,
    }
  end

  puts "Page #{page}: #{all.size} total books collected"
  page += 1
  sleep 2
end

puts "Done — #{all.size} books"

8.JavaScript-rendered storefronts

Nokogiri parses the static HTML that Faraday fetches. If prices, inventory counts, or product listings are injected by React, Vue, or a similar framework after the initial page load, Faraday returns the shell HTML and Nokogiri finds nothing. Pass the URL to OmniScrape with `mode: "js_rendering"` to run a headless browser that executes JavaScript before returning the rendered HTML.

Use `js_wait_selector` to tell the browser to wait until a specific element is present in the DOM before capturing the page. This is more reliable than a fixed `js_wait_timeout` because it adapts to variable load times. See scraping JavaScript-rendered pages for a deeper breakdown.

spa.rb
ruby
123456789101112131415161718# Basic JS rendering — waits for default timeout
html = OmniScrape.fetch_html(
  "https://spa-store.com/catalog",
  mode: "js_rendering"
)

# More reliable: wait for a specific element to appear in the DOM
res = OmniScrape.connection.post("/v1/scrape", {
  url:              "https://spa-store.com/catalog",
  mode:             "js_rendering",
  output_format:    "html",
  js_wait_selector: "[data-testid='product-grid']",
  js_wait_timeout:  8000,
})

html = res.body.dig("data", "content")
doc  = Nokogiri::HTML.parse(html)
doc.css("[data-testid='product-card']").each { |c| puts c.at_css("h2")&.text }

9.Retries, error handling, and failure modes

Sidekiq retry is not free — each retry on a scraping job may consume API credits and delay other work. Configure explicit limits and handle each error class differently rather than letting Sidekiq retry everything blindly:

  • 401 Unauthorized — raise a non-retryable error class; the key is wrong or missing. Fix ENV and redeploy, do not retry.
  • 402 Payment Required — pause the scraping queue immediately and alert your billing contact. Retrying burns no credits but wastes queue capacity.
  • 429 Too Many Requests — back off in the Faraday retry middleware before Sidekiq ever sees the failure. Use `interval_randomness` to spread retries across workers.
  • 502 / 503 Bad Gateway — safe to retry with jitter; usually a transient upstream issue. Limit to 3–5 attempts.
  • success: false in response body — the request completed but the target returned unusable content (challenge page slipped through, empty body, etc.). Dead-letter the URL for manual review rather than retrying automatically.
  • Nokogiri returns nil for expected selectors — the page structure changed. Alert and pause the job class; retrying will produce the same empty records.

Frequently asked questions

Should I use Faraday or HTTParty for scraping?

Faraday. Its middleware stack lets you add retry logic, JSON encoding, logging, and custom headers as composable layers without changing your core request code. HTTParty is fine for quick one-off scripts, but for production pipelines with Sidekiq the middleware model pays off quickly. Avoid open-uri for anything beyond local development — it has no timeout control and encourages unsafe patterns.

When should I use Nokogiri CSS selectors versus XPath?

Use CSS selectors for the majority of cases — product cards, price elements, links. They are shorter and easier to read. Switch to XPath when you need to select a node based on its text content (e.g., `//td[text()='Price']`), traverse to a parent or sibling, or express positional conditions that CSS cannot handle. Nokogiri supports both on the same document.

How many Sidekiq threads should I allocate to the scraping queue?

Start with a concurrency of 5–10 on a dedicated scraping queue. Monitor OmniScrape 429 responses and `billing.charged` per job before increasing. Scraping jobs are mostly I/O-bound, so higher concurrency is viable, but each thread holds a Faraday connection open and counts toward your API rate limits. Separate the scraping queue from your default queue so a burst of scrape jobs does not starve user-facing work.

Can I use Mechanize instead of Faraday and Nokogiri?

Mechanize wraps both HTTP and HTML parsing and can follow links and submit forms, which is convenient for simple crawls. However, it does not handle modern TLS configurations well, has no middleware system for retries or logging, and cannot integrate cleanly with Sidekiq workers. For anything beyond internal tools on controlled HTTP servers, use Faraday plus Nokogiri explicitly.

How do I handle sites that require cookies or session state?

Use OmniScrape's `session_id` field to persist a browser session across requests. Pass the same `session_id` string on each call to the API and the headless browser will reuse cookies and local storage from the previous request. For direct Faraday calls to open sites, use a `Faraday::CookieJar` middleware or manage the `Set-Cookie` / `Cookie` headers manually.

What is the right way to store scraped data in Rails?

Use `upsert_all` with a unique constraint on a natural key (URL, SKU, or external ID) rather than `find_or_create_by` in a loop. This avoids N+1 database round-trips and handles concurrent workers writing the same record safely. Wrap large bulk inserts in a transaction and consider a staging table if you need to diff against existing records before committing.

How do I avoid scraping the same URL twice across parallel Sidekiq workers?

Use a Redis set to track enqueued or completed URLs before pushing to Sidekiq. A simple `SADD` with the URL returns 0 if the URL was already present, so you can skip `perform_async`. Alternatively, use a database-level unique index on a `scraped_urls` table and rescue `ActiveRecord::RecordNotUnique` in the worker. For large crawls, a Bloom filter in Redis is more memory-efficient than a full set.

Related guides

  • Web Scraping with Python
  • How to Bypass Cloudflare When Web Scraping
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

Ready to get started?

Start scraping protected sites today — no credit card required.

OmniScrape

Web scraping infrastructure for developers. One API call to bypass any protection.

All systems operational

Product

  • Web Unlocker
  • Browser-as-a-Service
  • Residential Proxies
  • Pricing

Developers

  • API Reference ↗
  • Quickstart ↗
  • All Guides
  • Use Cases
  • Status

Company

  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • Cookie Policy
  • Acceptable Use

Solutions

  • E-commerce Web Scraping: Catalog Intelligence at Production Scale
  • Real Estate Web Scraping: Listings, Comps, and Market Data
  • SERP Web Scraping: Agency Rank Tracking Workflow
  • Job Board Web Scraping: HR Tech Pipeline for Labor Market Intelligence
  • Price Monitoring with Web Scraping: A Practical Developer Guide
  • Lead Generation Web Scraping: Compliant Inbound Enrichment for Sales Teams
  • Market Research Web Scraping: Multi-Geo Data Collection for Research Firms
  • Sentiment Analysis Web Scraping: Build a Production Review Pipeline
  • Logistics Web Scraping: Carrier Rates, Port ETAs, and Sailing Schedules
  • Social Media Web Scraping: Brand Mention Monitoring from Public Pages
  • LLM Training Data Scraping: Building Clean Web Corpora
  • Travel Web Scraping: Hotel Rates, Flight Fares & Parity Monitoring

Web Scraping by Language

  • Web Scraping with Python
  • Web Scraping with Node.js: fetch, Cheerio, and the OmniScrape API
  • Web Scraping with Java: HttpClient, Jsoup, and OmniScrape API
  • Web Scraping with PHP
  • Web Scraping with Go (Golang)
  • Web Scraping with Ruby: Faraday, Nokogiri, Sidekiq & OmniScrape
  • Web Scraping with C#: HttpClient, AngleSharp, and OmniScrape API
  • Web Scraping with Rust
  • Web Scraping with R: httr2, rvest, and the OmniScrape API
  • Web Scraping with C++
  • Web Scraping with Elixir
  • Web Scraping with Perl: Mojo::UserAgent, Mojo::DOM, and OmniScrape

Anti-Bot Bypass

  • How to Bypass Cloudflare When Web Scraping
  • How to Bypass DataDome When Web Scraping
  • How to Bypass Akamai Bot Manager When Web Scraping
  • How to Bypass PerimeterX (HUMAN Security) When Web Scraping
  • Bypassing AWS WAF When Web Scraping: Rate Rules, Bot Control, and Residential Proxies
  • How to Bypass Imperva (Incapsula) When Web Scraping
  • How to Bypass Kasada Bot Protection When Web Scraping
  • How to Bypass F5 BIG-IP Bot Defense When Web Scraping
  • How to Bypass Distil Networks When Web Scraping
  • How to Bypass reCAPTCHA When Web Scraping

Scraping Tools

  • Playwright Web Scraping: Practical Patterns for Protected Sites
  • Puppeteer Web Scraping: Patterns, Anti-Bot Limits, and BaaS Integration
  • Selenium Web Scraping: Practical Patterns for Real-World Projects
  • Scrapy Web Scraping with OmniScrape: Download Middleware, Pipelines, and Scale
  • Beautiful Soup Web Scraping: A Practical Guide
  • cURL Web Scraping: Shell-Native Patterns with OmniScrape
  • HTTPX Web Scraping: Async Python with OmniScrape
  • Cheerio Web Scraping: A Practical Guide

Site-Specific Scrapers

  • Amazon Scraper: Product Data, Buy Box, Reviews, and Multi-Marketplace
  • Google Search Scraper: Extract SERP Rankings and Features
  • Google Maps Scraper: Extract Business Listings and Place Data
  • LinkedIn Scraper: Companies, Jobs, and Public Profiles
  • Walmart Scraper: Prices, Stock, Rollback Deals, and Fulfillment Data
  • eBay Scraper: Extract Listings, Auctions, and Sold Prices
  • Shopify Scraper: Products, Variants, and JSON Endpoints
  • Indeed Scraper: Extract Job Listings, Salaries, and Company Data
  • Zillow Scraper: Extract Listings, Zestimates, and Price History
  • Reddit Scraper: Posts, Comments, and Subreddit Data
  • X (Twitter) Scraper: Tweets, Profiles, and Hashtags
  • Instagram Scraper: Posts, Reels, and Profile Metrics
  • TikTok Scraper: Extract Videos, Hashtags, and Trend Data
  • YouTube Scraper: Extract Video Metadata, Comments, and Channel Stats
  • Booking.com Scraper: Hotel Rates, Room Types, and Availability
  • Airbnb Scraper: Listings, Calendars, and Nightly Rates
  • Crunchbase Scraper: Extract Funding Rounds, Companies, and Investors
  • Yelp Scraper: Extract Business Listings, Ratings, and Reviews
  • Glassdoor Scraper: Employer Ratings, Salaries, and Review Data
  • Trustpilot Scraper: TrustScore, Star Distribution, and Review Monitoring

How We Compare

  • OmniScrape vs ScrapingBee
  • OmniScrape vs ZenRows
  • OmniScrape vs ScraperAPI: A Practical Developer Comparison
  • OmniScrape vs Bright Data: Which Web Scraping Platform Fits Your Team?
  • OmniScrape vs Oxylabs
  • OmniScrape vs Smartproxy
  • OmniScrape vs Crawlbase: API Design, Observability, and Migration Guide
  • OmniScrape vs Apify

Web Scraping Guides

  • Web Scraping Without Getting Blocked
  • Web Scraping Proxy Guide: Types, Sessions, Geo, and OmniScrape Integration
  • Solve CAPTCHAs While Web Scraping
  • Web Scraping vs Web Crawling: Architecture, Patterns, and When to Use Each
  • Headless Browser Scraping: When to Use It and How to Do It Right
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns
  • Rotating Proxies for Web Scraping: Policies, Session Binding, and Geo Pools
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs

© 2026 OmniScrape. All rights reserved.

PrivacyTermsRefundsAcceptable Use