OmniScrape
ProductsSolutionsGuidesDocs ↗PricingAbout
ProductsSolutionsGuidesDocs ↗PricingAbout
← All guides
Web Scraping by Language

Web Scraping with Elixir

Elixir's strength is orchestration: millions of lightweight BEAM processes, isolated failure domains, and backpressure primitives that keep the cluster alive when one upstream domain goes dark. You are not choosing Elixir because it parses HTML faster than Python — you are choosing it because a single slow SPA should restart one supervised worker, not cascade across fifty healthy ones.

The practical stack is straightforward: Req handles HTTP with sane defaults and built-in JSON encoding; Floki parses HTML using CSS selectors without spawning a browser; OTP supervision wraps workers so crashes are contained and retried automatically. For bot-protected or JavaScript-heavy pages, workers POST to the OmniScrape API and feed the returned HTML straight into Floki — the network layer is replaced, the parsing layer stays identical. If your team also prototypes in Python, web scraping with Python covers the equivalent patterns side by side.

This guide walks through every layer: Mix dependencies, Req fetch patterns, Floki extraction, OmniScrape integration, OTP supervision trees, async_stream concurrency, Broadway ingest pipelines, js_rendering for rendered DOM, and production-grade error budgeting.

On this page

1. Mix dependencies2. Fetch with Req3. Parse with Floki4. OmniScrape via Req5. OTP supervision for scrape workers6. Concurrent URLs with async_stream7. Broadway for ingest pipelines8. js_rendering for rendered DOM9. Let it crash — with budget caps10. FAQ

1.Mix dependencies

Three libraries cover the full scraping stack. Req is the modern HTTP client — actively maintained, composable middleware, built-in retry and JSON handling. Floki is a pure-Elixir HTML parser backed by html5ever; it accepts tag-soup and returns a traversable tree. Jason is the standard JSON codec; Phoenix projects already have it, and plain Mix apps should pin it explicitly.

Run `mix deps.get` after editing mix.exs. Pin minor versions so CI does not silently pull a breaking change. Lock files are committed — treat them as source of truth.

mix.exs
elixir
12345678# mix.exs
defp deps do
  [
    {:req, "~> 0.5"},
    {:floki, "~> 0.36"},
    {:jason, "~> 1.4"}
  ]
end

2.Fetch with Req

Req.get!/2 raises on network failure — acceptable in one-off scripts, dangerous in supervised workers. Production code uses Req.request/1 or Req.get/2 and pattern-matches on `{:ok, resp}` versus `{:error, reason}`. The distinction matters: a crash in a Task.Supervisor child gets restarted; an unhandled exception in a GenServer call can take down the caller.

Set `receive_timeout` explicitly. The default is generous enough for fast servers but will block a process indefinitely against a stalled SPA. Thirty seconds is a reasonable starting point; raise it only for known slow endpoints. Req also supports automatic retries via the `retry:` option — pair it with exponential backoff for transient 5xx responses.

Req attaches a User-Agent header by default. Many sites inspect this header; override it with `headers: [{"user-agent", "..."}]` if you need to match a browser fingerprint, or delegate that concern entirely to OmniScrape's Web Unlocker.

fetch.exs
elixir
12345678910111213url = "https://books.toscrape.com/catalogue/page-1.html"

case Req.get(url, receive_timeout: 30_000) do
  {:ok, %{status: 200, body: html}} ->
    IO.puts("Fetched #{byte_size(html)} bytes")
    {:ok, html}

  {:ok, %{status: status}} ->
    {:error, "unexpected status #{status}"}

  {:error, reason} ->
    {:error, reason}
end

3.Parse with Floki

Floki.parse_document/1 returns `{:ok, tree}` or `{:error, reason}`. The bang variant raises on malformed input — use it only when you control the HTML source. For scraped content from arbitrary sites, always pattern-match on the tuple so a single malformed page does not crash the worker.

Floki.find/2 accepts standard CSS selectors: element names, class selectors, attribute selectors, descendant combinators, and pseudo-selectors like `:first-child`. Chaining `|>` calls keeps extraction logic readable. `Floki.attribute/3` pulls a named attribute from the first matching node; `Floki.text/2` concatenates all text nodes under a selector with an optional separator.

For deeply nested structures, `Floki.find/2` searches the full subtree. If you need to scope a search to a specific node, pass the node directly rather than the whole document — this avoids false matches from unrelated parts of the page.

parse.exs
elixir
123456789101112131415161718192021222324252627282930{:ok, document} = Floki.parse_document(html)

books =
  document
  |> Floki.find("article.product_pod")
  |> Enum.map(fn card ->
    title =
      card
      |> Floki.find("h3 a")
      |> Floki.attribute("title")
      |> List.first()

    price =
      card
      |> Floki.find(".price_color")
      |> Floki.text(sep: " ")
      |> String.trim()

    rating =
      card
      |> Floki.find("p.star-rating")
      |> Floki.attribute("class")
      |> List.first()
      |> then(&String.replace(&1 || "", "star-rating ", ""))

    %{title: title, price: price, rating: rating}
  end)

IO.puts("Parsed #{length(books)} books")
IO.inspect(Enum.take(books, 3), pretty: true)

4.OmniScrape via Req

For bot-protected pages, replace the Req.get call with a POST to the OmniScrape API. The request body is plain JSON; Req serialises the `json:` map automatically and sets `Content-Type: application/json`. The API key goes in the `X-API-Key` header — never hardcode it, always read from the environment.

On success, `body["data"]["content"]` holds the rendered HTML. Pass it directly to Floki.parse_document/1 — the rest of your parsing pipeline is unchanged. `metadata["method_used"]` tells you whether the request was served from the fast HTTP lane or escalated to a headless browser. Use this for observability and cost attribution.

Set `receive_timeout` to at least 120 seconds. Browser rendering and challenge solving add latency beyond a normal HTTP round trip. When fetch returns challenge HTML, OmniScrape's Web Unlocker replaces the network layer transparently — see Cloudflare bypass for a deeper walkthrough.

omniscrape.exs
elixir
123456789101112131415161718192021222324252627282930api_key = System.fetch_env!("OMNISCRAPE_KEY")

resp =
  Req.post!("https://api.omniscrape.io/v1/scrape",
    headers: [{"X-API-Key", api_key}],
    json: %{
      url: "https://protected-shop.com/deal/441",
      mode: "auto",
      output_format: "html",
      enable_solver: true
    },
    receive_timeout: 120_000
  )

body = resp.body

unless body["success"] do
  raise "OmniScrape request failed: #{inspect(body)}"
end

html        = get_in(body, ["data", "content"])
method_used = get_in(body, ["metadata", "method_used"])
solver_used = get_in(body, ["metadata", "solver_used"])
charged     = get_in(body, ["billing", "charged"])

IO.puts("method=#{method_used} solver=#{solver_used} charged=#{charged}")

{:ok, doc} = Floki.parse_document(html)
price = doc |> Floki.find(".product-price") |> Floki.text() |> String.trim()
IO.puts("Price: #{price}")

5.OTP supervision for scrape workers

Wrap scrape workers under a dedicated supervisor so a timeout against one slow domain restarts only that child, not the process managing fifty other domains. The simplest topology: a `Supervisor` with `strategy: :one_for_one` that owns a `Task.Supervisor` for ad-hoc tasks and a `Registry` for named workers.

For long-running domain scrapers, use a `GenServer` instead of a bare Task. GenServer state tracks retry count, last-fetched timestamp, and backoff interval. The supervisor restarts the GenServer on crash; the GenServer itself decides whether to retry or escalate to a dead-letter queue.

Avoid `strategy: :one_for_all` in scrape trees — a single bad URL should never restart the workers handling healthy domains. Keep the supervision tree shallow: one top-level supervisor, one task supervisor per logical group (e.g., per-domain or per-job-type).

supervisor.ex
elixir
1234567891011121314151617181920212223242526defmodule Scraper.Supervisor do
  use Supervisor

  def start_link(opts), do: Supervisor.start_link(__MODULE__, opts, name: __MODULE__)

  def init(_opts) do
    children = [
      {Task.Supervisor, name: Scraper.TaskSupervisor},
      {Registry, keys: :unique, name: Scraper.Registry}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end
end

# Dispatching a supervised task from a worker:
task =
  Task.Supervisor.async_nolink(Scraper.TaskSupervisor, fn ->
    Scraper.OmniScrape.fetch_and_parse(url)
  end)

case Task.yield(task, 130_000) || Task.shutdown(task) do
  {:ok, result}  -> handle_result(result)
  {:exit, reason} -> handle_failure(url, reason)
  nil             -> handle_timeout(url)
end

6.Concurrent URLs with async_stream

Task.async_stream/3 is the idiomatic way to fan out over a list of URLs while capping parallelism. Set `max_concurrency` to match your OmniScrape plan's concurrency limit — starting at 5 to 10 is safe while you observe 429 rates in telemetry. Set `timeout` slightly above your API `receive_timeout` so the stream kills a stalled task before it blocks the pipeline indefinitely.

The `on_timeout: :kill_task` option sends an exit signal to the task process and returns `{:exit, :timeout}` in the stream. Always handle both `:ok` and `:exit` tuples — letting an unmatched clause crash the stream defeats the purpose of bounded concurrency.

For very large URL lists, stream lazily rather than collecting into a list first. Pipe the async_stream result directly into `Stream.each` or `Flow` to keep memory bounded.

stream.exs
elixir
123456789101112131415161718192021urls = [
  "https://example.com/product/1",
  "https://example.com/product/2",
  "https://example.com/product/3"
]

results =
  urls
  |> Task.async_stream(
       &Scraper.OmniScrape.scrape_css/1,
       max_concurrency: 5,
       timeout: 130_000,
       on_timeout: :kill_task
     )
  |> Enum.reduce({[], []}, fn
    {:ok, data},     {ok, err} -> {[data | ok], err}
    {:exit, reason}, {ok, err} -> {ok, [{:error, reason} | err]}
  end)

{successes, failures} = results
IO.puts("ok=#{length(successes)} failed=#{length(failures)}")

7.Broadway for ingest pipelines

When scrape results feed a message queue, database, or data warehouse, Broadway adds built-in backpressure, batching, and acknowledgement semantics on top of GenStage. Define a producer that emits URLs (from SQS, RabbitMQ, or a custom GenStage source), fetch and parse in `handle_message/3`, and write to Postgres in `handle_batch/4`.

Broadway handles the concurrency model for you: `concurrency` on the processor stage controls how many OmniScrape requests run in parallel; `batch_size` and `batch_timeout` on the batcher control how many rows are inserted per transaction. Failed messages are nacked and requeued without manual retry logic.

Use `output_format: "css_extractor"` with `css_selectors` when fields are fixed — Broadway's batcher receives structured maps directly without a Floki parsing step, reducing per-message CPU cost at high throughput.

pipeline.ex
elixir
123456789101112131415161718192021222324252627282930313233343536defmodule Scraper.Pipeline do
  use Broadway

  alias Broadway.Message

  def start_link(_opts) do
    Broadway.start_link(__MODULE__,
      name: __MODULE__,
      producer: [
        module: {Scraper.UrlProducer, []},
        concurrency: 1
      ],
      processors: [
        default: [concurrency: 5]
      ],
      batchers: [
        db: [concurrency: 2, batch_size: 50, batch_timeout: 2_000]
      ]
    )
  end

  @impl true
  def handle_message(_processor, %Message{data: url} = msg, _ctx) do
    case Scraper.OmniScrape.fetch_structured(url) do
      {:ok, fields} -> Message.put_data(msg, fields) |> Message.put_batcher(:db)
      {:error, _}   -> Message.failed(msg, "fetch_error")
    end
  end

  @impl true
  def handle_batch(:db, messages, _batch_info, _ctx) do
    rows = Enum.map(messages, & &1.data)
    Scraper.Repo.insert_all("products", rows, on_conflict: :replace_all)
    messages
  end
end

8.js_rendering for rendered DOM

Floki operates on raw HTML — it does not evaluate JavaScript, execute XHR calls, or wait for React hydration. If `Req.get` returns a skeleton document with empty product containers, the page requires JavaScript execution. Use `mode: "js_rendering"` in the OmniScrape request to run a headless browser server-side.

Pair `js_rendering` with `js_wait_selector` to block until a specific DOM element appears before the snapshot is taken. This avoids arbitrary `js_wait_timeout` sleeps and produces a stable HTML response regardless of network jitter on the rendering host. Set `js_wait_timeout` as a ceiling — the request resolves as soon as the selector matches, not after the full timeout.

See scraping JavaScript-rendered pages for a full breakdown of when to use `js_rendering` versus `auto` mode escalation.

js_rendering.exs
elixir
1234567891011121314151617181920212223242526272829303132333435api_key = System.fetch_env!("OMNISCRAPE_KEY")

resp =
  Req.post!("https://api.omniscrape.io/v1/scrape",
    headers: [{"X-API-Key", api_key}],
    json: %{
      url: "https://spa-store.com/catalog",
      mode: "js_rendering",
      output_format: "html",
      js_wait_selector: ".product-card",
      js_wait_timeout: 12_000
    },
    receive_timeout: 120_000
  )

body = resp.body

if body["success"] do
  html = get_in(body, ["data", "content"])
  {:ok, doc} = Floki.parse_document(html)

  products =
    doc
    |> Floki.find(".product-card")
    |> Enum.map(fn card ->
      %{
        name:  card |> Floki.find(".product-name")  |> Floki.text() |> String.trim(),
        price: card |> Floki.find(".product-price") |> Floki.text() |> String.trim()
      }
    end)

  IO.inspect(products, label: "products")
else
  IO.puts("Request failed: #{inspect(body)}")
end

9.Let it crash — with budget caps

OTP's "let it crash" philosophy means you trust the supervisor to restart failed workers rather than defensive-coding every possible failure. That is sound for transient network errors. It is not a license to ignore billing signals or spin up infinite retries against a permanently blocked domain.

Map each HTTP status and API error to a concrete action. Treat 401 as a boot-time configuration error — fail fast before any work starts. Treat 402 as a billing ceiling — pause the Broadway producer and alert on-call rather than burning through a credit overage. Treat 429 as a rate signal — implement exponential backoff in GenServer state, not a bare `Process.sleep`. Treat 502 as a safe retry — the supervisor handles it automatically with a restart delay.

  • 401 — missing or invalid API key: raise at application boot, do not start the supervisor tree
  • 402 — billing limit reached: emit a telemetry event, pause Broadway producer, alert on-call
  • 429 — rate limited: exponential backoff in GenServer retry state, cap at 5 attempts
  • 502 / 503 — upstream error: safe to retry via Task.Supervisor restart or Broadway nack
  • success: false with known error code — log structured error, route to dead-letter queue, no infinite restart
  • Floki.parse_document error — malformed HTML: log raw response for inspection, skip record, do not crash worker
  • js_wait_selector timeout — selector never appeared: check if page structure changed, alert if error rate exceeds threshold

Frequently asked questions

Should I use Req or HTTPoison in a new Elixir project?

Req. It is actively maintained, ships with composable middleware for retries, compression, and JSON encoding out of the box, and has a cleaner API than HTTPoison. HTTPoison wraps hackney and carries legacy design decisions — avoid it in greenfield projects. Mint and Finch are good lower-level alternatives if you need connection pooling without the Req abstraction.

When should I use Floki versus OmniScrape's css_extractor output format?

Use css_extractor when the fields you need are fixed and well-defined — the API does the extraction server-side and returns a structured map, so your Elixir code never touches raw HTML. Use Floki when you need to traverse complex or variable DOM structures, archive the full HTML for later reprocessing, or extract fields whose selectors vary by page type. For Broadway pipelines at high throughput, css_extractor reduces per-message CPU cost significantly.

Floki or SweetXml for parsing?

Floki for HTML scraped from websites. SweetXml is the right tool when you have genuine XML — RSS feeds, sitemaps, API responses with XML content types. Feeding tag-soup HTML into SweetXml produces unreliable results because browsers are far more lenient than XML parsers about malformed markup.

How do I choose max_concurrency for Task.async_stream against OmniScrape?

Start at 5 and watch two metrics: 429 response rates in your telemetry and the `billing.charged` field per request. If you see no 429s and billing looks expected, increment by 5 and observe again. The right ceiling depends on your OmniScrape plan's concurrency allowance and the target site's tolerance — there is no universal number. Instrument with :telemetry.execute/3 so you can graph this without changing code.

Is Phoenix LiveView useful for scraping infrastructure?

Yes, as an internal operations dashboard. LiveView's PubSub integration makes it straightforward to stream job progress, error rates, and billing spend in real time without polling. It is not appropriate as a customer-facing interface for scraped data — those pages should read from a cache or database, not trigger live scrape requests.

How do I handle pagination across many pages without blocking the supervisor?

Model pagination as a GenServer that holds a cursor (page number or next-page URL) in state. On each `handle_info(:next_page, state)` call, fetch one page, parse results, persist them, then send `self()` a `:next_page` message with an updated cursor. The GenServer processes pages sequentially per domain but multiple GenServers run concurrently under the supervisor — one per domain or job. This avoids spawning an unbounded number of tasks and keeps memory usage predictable.

What is the difference between mode auto and js_rendering in OmniScrape?

mode: "auto" tries the fast HTTP lane first and escalates to a headless browser automatically if the response looks like a bot challenge or empty skeleton. It is the right default for most targets. mode: "js_rendering" forces headless browser execution unconditionally — use it when you already know the page requires JavaScript and you want to skip the fast-lane attempt to avoid the escalation latency. For bot-protected pages, combine mode: "auto" with enable_solver: true.

Related guides

  • Web Scraping with Python
  • How to Bypass Cloudflare When Web Scraping
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

Ready to get started?

Start scraping protected sites today — no credit card required.

OmniScrape

Web scraping infrastructure for developers. One API call to bypass any protection.

All systems operational

Product

  • Web Unlocker
  • Browser-as-a-Service
  • Residential Proxies
  • Pricing

Developers

  • API Reference ↗
  • Quickstart ↗
  • All Guides
  • Use Cases
  • Status

Company

  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • Cookie Policy
  • Acceptable Use

Solutions

  • E-commerce Web Scraping: Catalog Intelligence at Production Scale
  • Real Estate Web Scraping: Listings, Comps, and Market Data
  • SERP Web Scraping: Agency Rank Tracking Workflow
  • Job Board Web Scraping: HR Tech Pipeline for Labor Market Intelligence
  • Price Monitoring with Web Scraping: A Practical Developer Guide
  • Lead Generation Web Scraping: Compliant Inbound Enrichment for Sales Teams
  • Market Research Web Scraping: Multi-Geo Data Collection for Research Firms
  • Sentiment Analysis Web Scraping: Build a Production Review Pipeline
  • Logistics Web Scraping: Carrier Rates, Port ETAs, and Sailing Schedules
  • Social Media Web Scraping: Brand Mention Monitoring from Public Pages
  • LLM Training Data Scraping: Building Clean Web Corpora
  • Travel Web Scraping: Hotel Rates, Flight Fares & Parity Monitoring

Web Scraping by Language

  • Web Scraping with Python
  • Web Scraping with Node.js: fetch, Cheerio, and the OmniScrape API
  • Web Scraping with Java: HttpClient, Jsoup, and OmniScrape API
  • Web Scraping with PHP
  • Web Scraping with Go (Golang)
  • Web Scraping with Ruby: Faraday, Nokogiri, Sidekiq & OmniScrape
  • Web Scraping with C#: HttpClient, AngleSharp, and OmniScrape API
  • Web Scraping with Rust
  • Web Scraping with R: httr2, rvest, and the OmniScrape API
  • Web Scraping with C++
  • Web Scraping with Elixir
  • Web Scraping with Perl: Mojo::UserAgent, Mojo::DOM, and OmniScrape

Anti-Bot Bypass

  • How to Bypass Cloudflare When Web Scraping
  • How to Bypass DataDome When Web Scraping
  • How to Bypass Akamai Bot Manager When Web Scraping
  • How to Bypass PerimeterX (HUMAN Security) When Web Scraping
  • Bypassing AWS WAF When Web Scraping: Rate Rules, Bot Control, and Residential Proxies
  • How to Bypass Imperva (Incapsula) When Web Scraping
  • How to Bypass Kasada Bot Protection When Web Scraping
  • How to Bypass F5 BIG-IP Bot Defense When Web Scraping
  • How to Bypass Distil Networks When Web Scraping
  • How to Bypass reCAPTCHA When Web Scraping

Scraping Tools

  • Playwright Web Scraping: Practical Patterns for Protected Sites
  • Puppeteer Web Scraping: Patterns, Anti-Bot Limits, and BaaS Integration
  • Selenium Web Scraping: Practical Patterns for Real-World Projects
  • Scrapy Web Scraping with OmniScrape: Download Middleware, Pipelines, and Scale
  • Beautiful Soup Web Scraping: A Practical Guide
  • cURL Web Scraping: Shell-Native Patterns with OmniScrape
  • HTTPX Web Scraping: Async Python with OmniScrape
  • Cheerio Web Scraping: A Practical Guide

Site-Specific Scrapers

  • Amazon Scraper: Product Data, Buy Box, Reviews, and Multi-Marketplace
  • Google Search Scraper: Extract SERP Rankings and Features
  • Google Maps Scraper: Extract Business Listings and Place Data
  • LinkedIn Scraper: Companies, Jobs, and Public Profiles
  • Walmart Scraper: Prices, Stock, Rollback Deals, and Fulfillment Data
  • eBay Scraper: Extract Listings, Auctions, and Sold Prices
  • Shopify Scraper: Products, Variants, and JSON Endpoints
  • Indeed Scraper: Extract Job Listings, Salaries, and Company Data
  • Zillow Scraper: Extract Listings, Zestimates, and Price History
  • Reddit Scraper: Posts, Comments, and Subreddit Data
  • X (Twitter) Scraper: Tweets, Profiles, and Hashtags
  • Instagram Scraper: Posts, Reels, and Profile Metrics
  • TikTok Scraper: Extract Videos, Hashtags, and Trend Data
  • YouTube Scraper: Extract Video Metadata, Comments, and Channel Stats
  • Booking.com Scraper: Hotel Rates, Room Types, and Availability
  • Airbnb Scraper: Listings, Calendars, and Nightly Rates
  • Crunchbase Scraper: Extract Funding Rounds, Companies, and Investors
  • Yelp Scraper: Extract Business Listings, Ratings, and Reviews
  • Glassdoor Scraper: Employer Ratings, Salaries, and Review Data
  • Trustpilot Scraper: TrustScore, Star Distribution, and Review Monitoring

How We Compare

  • OmniScrape vs ScrapingBee
  • OmniScrape vs ZenRows
  • OmniScrape vs ScraperAPI: A Practical Developer Comparison
  • OmniScrape vs Bright Data: Which Web Scraping Platform Fits Your Team?
  • OmniScrape vs Oxylabs
  • OmniScrape vs Smartproxy
  • OmniScrape vs Crawlbase: API Design, Observability, and Migration Guide
  • OmniScrape vs Apify

Web Scraping Guides

  • Web Scraping Without Getting Blocked
  • Web Scraping Proxy Guide: Types, Sessions, Geo, and OmniScrape Integration
  • Solve CAPTCHAs While Web Scraping
  • Web Scraping vs Web Crawling: Architecture, Patterns, and When to Use Each
  • Headless Browser Scraping: When to Use It and How to Do It Right
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns
  • Rotating Proxies for Web Scraping: Policies, Session Binding, and Geo Pools
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs

© 2026 OmniScrape. All rights reserved.

PrivacyTermsRefundsAcceptable Use