1.Mix dependencies
Three libraries cover the full scraping stack. Req is the modern HTTP client — actively maintained, composable middleware, built-in retry and JSON handling. Floki is a pure-Elixir HTML parser backed by html5ever; it accepts tag-soup and returns a traversable tree. Jason is the standard JSON codec; Phoenix projects already have it, and plain Mix apps should pin it explicitly.
Run `mix deps.get` after editing mix.exs. Pin minor versions so CI does not silently pull a breaking change. Lock files are committed — treat them as source of truth.
12345678# mix.exs
defp deps do
[
{:req, "~> 0.5"},
{:floki, "~> 0.36"},
{:jason, "~> 1.4"}
]
end
2.Fetch with Req
Req.get!/2 raises on network failure — acceptable in one-off scripts, dangerous in supervised workers. Production code uses Req.request/1 or Req.get/2 and pattern-matches on `{:ok, resp}` versus `{:error, reason}`. The distinction matters: a crash in a Task.Supervisor child gets restarted; an unhandled exception in a GenServer call can take down the caller.
Set `receive_timeout` explicitly. The default is generous enough for fast servers but will block a process indefinitely against a stalled SPA. Thirty seconds is a reasonable starting point; raise it only for known slow endpoints. Req also supports automatic retries via the `retry:` option — pair it with exponential backoff for transient 5xx responses.
Req attaches a User-Agent header by default. Many sites inspect this header; override it with `headers: [{"user-agent", "..."}]` if you need to match a browser fingerprint, or delegate that concern entirely to OmniScrape's Web Unlocker.
12345678910111213url = "https://books.toscrape.com/catalogue/page-1.html"
case Req.get(url, receive_timeout: 30_000) do
{:ok, %{status: 200, body: html}} ->
IO.puts("Fetched #{byte_size(html)} bytes")
{:ok, html}
{:ok, %{status: status}} ->
{:error, "unexpected status #{status}"}
{:error, reason} ->
{:error, reason}
end
3.Parse with Floki
Floki.parse_document/1 returns `{:ok, tree}` or `{:error, reason}`. The bang variant raises on malformed input — use it only when you control the HTML source. For scraped content from arbitrary sites, always pattern-match on the tuple so a single malformed page does not crash the worker.
Floki.find/2 accepts standard CSS selectors: element names, class selectors, attribute selectors, descendant combinators, and pseudo-selectors like `:first-child`. Chaining `|>` calls keeps extraction logic readable. `Floki.attribute/3` pulls a named attribute from the first matching node; `Floki.text/2` concatenates all text nodes under a selector with an optional separator.
For deeply nested structures, `Floki.find/2` searches the full subtree. If you need to scope a search to a specific node, pass the node directly rather than the whole document — this avoids false matches from unrelated parts of the page.
123456789101112131415161718192021222324252627282930{:ok, document} = Floki.parse_document(html)
books =
document
|> Floki.find("article.product_pod")
|> Enum.map(fn card ->
title =
card
|> Floki.find("h3 a")
|> Floki.attribute("title")
|> List.first()
price =
card
|> Floki.find(".price_color")
|> Floki.text(sep: " ")
|> String.trim()
rating =
card
|> Floki.find("p.star-rating")
|> Floki.attribute("class")
|> List.first()
|> then(&String.replace(&1 || "", "star-rating ", ""))
%{title: title, price: price, rating: rating}
end)
IO.puts("Parsed #{length(books)} books")
IO.inspect(Enum.take(books, 3), pretty: true)
4.OmniScrape via Req
For bot-protected pages, replace the Req.get call with a POST to the OmniScrape API. The request body is plain JSON; Req serialises the `json:` map automatically and sets `Content-Type: application/json`. The API key goes in the `X-API-Key` header — never hardcode it, always read from the environment.
On success, `body["data"]["content"]` holds the rendered HTML. Pass it directly to Floki.parse_document/1 — the rest of your parsing pipeline is unchanged. `metadata["method_used"]` tells you whether the request was served from the fast HTTP lane or escalated to a headless browser. Use this for observability and cost attribution.
Set `receive_timeout` to at least 120 seconds. Browser rendering and challenge solving add latency beyond a normal HTTP round trip. When fetch returns challenge HTML, OmniScrape's Web Unlocker replaces the network layer transparently — see Cloudflare bypass for a deeper walkthrough.
123456789101112131415161718192021222324252627282930api_key = System.fetch_env!("OMNISCRAPE_KEY")
resp =
Req.post!("https://api.omniscrape.io/v1/scrape",
headers: [{"X-API-Key", api_key}],
json: %{
url: "https://protected-shop.com/deal/441",
mode: "auto",
output_format: "html",
enable_solver: true
},
receive_timeout: 120_000
)
body = resp.body
unless body["success"] do
raise "OmniScrape request failed: #{inspect(body)}"
end
html = get_in(body, ["data", "content"])
method_used = get_in(body, ["metadata", "method_used"])
solver_used = get_in(body, ["metadata", "solver_used"])
charged = get_in(body, ["billing", "charged"])
IO.puts("method=#{method_used} solver=#{solver_used} charged=#{charged}")
{:ok, doc} = Floki.parse_document(html)
price = doc |> Floki.find(".product-price") |> Floki.text() |> String.trim()
IO.puts("Price: #{price}")
5.OTP supervision for scrape workers
Wrap scrape workers under a dedicated supervisor so a timeout against one slow domain restarts only that child, not the process managing fifty other domains. The simplest topology: a `Supervisor` with `strategy: :one_for_one` that owns a `Task.Supervisor` for ad-hoc tasks and a `Registry` for named workers.
For long-running domain scrapers, use a `GenServer` instead of a bare Task. GenServer state tracks retry count, last-fetched timestamp, and backoff interval. The supervisor restarts the GenServer on crash; the GenServer itself decides whether to retry or escalate to a dead-letter queue.
Avoid `strategy: :one_for_all` in scrape trees — a single bad URL should never restart the workers handling healthy domains. Keep the supervision tree shallow: one top-level supervisor, one task supervisor per logical group (e.g., per-domain or per-job-type).
1234567891011121314151617181920212223242526defmodule Scraper.Supervisor do
use Supervisor
def start_link(opts), do: Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
def init(_opts) do
children = [
{Task.Supervisor, name: Scraper.TaskSupervisor},
{Registry, keys: :unique, name: Scraper.Registry}
]
Supervisor.init(children, strategy: :one_for_one)
end
end
# Dispatching a supervised task from a worker:
task =
Task.Supervisor.async_nolink(Scraper.TaskSupervisor, fn ->
Scraper.OmniScrape.fetch_and_parse(url)
end)
case Task.yield(task, 130_000) || Task.shutdown(task) do
{:ok, result} -> handle_result(result)
{:exit, reason} -> handle_failure(url, reason)
nil -> handle_timeout(url)
end
6.Concurrent URLs with async_stream
Task.async_stream/3 is the idiomatic way to fan out over a list of URLs while capping parallelism. Set `max_concurrency` to match your OmniScrape plan's concurrency limit — starting at 5 to 10 is safe while you observe 429 rates in telemetry. Set `timeout` slightly above your API `receive_timeout` so the stream kills a stalled task before it blocks the pipeline indefinitely.
The `on_timeout: :kill_task` option sends an exit signal to the task process and returns `{:exit, :timeout}` in the stream. Always handle both `:ok` and `:exit` tuples — letting an unmatched clause crash the stream defeats the purpose of bounded concurrency.
For very large URL lists, stream lazily rather than collecting into a list first. Pipe the async_stream result directly into `Stream.each` or `Flow` to keep memory bounded.
123456789101112131415161718192021urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3"
]
results =
urls
|> Task.async_stream(
&Scraper.OmniScrape.scrape_css/1,
max_concurrency: 5,
timeout: 130_000,
on_timeout: :kill_task
)
|> Enum.reduce({[], []}, fn
{:ok, data}, {ok, err} -> {[data | ok], err}
{:exit, reason}, {ok, err} -> {ok, [{:error, reason} | err]}
end)
{successes, failures} = results
IO.puts("ok=#{length(successes)} failed=#{length(failures)}")
7.Broadway for ingest pipelines
When scrape results feed a message queue, database, or data warehouse, Broadway adds built-in backpressure, batching, and acknowledgement semantics on top of GenStage. Define a producer that emits URLs (from SQS, RabbitMQ, or a custom GenStage source), fetch and parse in `handle_message/3`, and write to Postgres in `handle_batch/4`.
Broadway handles the concurrency model for you: `concurrency` on the processor stage controls how many OmniScrape requests run in parallel; `batch_size` and `batch_timeout` on the batcher control how many rows are inserted per transaction. Failed messages are nacked and requeued without manual retry logic.
Use `output_format: "css_extractor"` with `css_selectors` when fields are fixed — Broadway's batcher receives structured maps directly without a Floki parsing step, reducing per-message CPU cost at high throughput.
123456789101112131415161718192021222324252627282930313233343536defmodule Scraper.Pipeline do
use Broadway
alias Broadway.Message
def start_link(_opts) do
Broadway.start_link(__MODULE__,
name: __MODULE__,
producer: [
module: {Scraper.UrlProducer, []},
concurrency: 1
],
processors: [
default: [concurrency: 5]
],
batchers: [
db: [concurrency: 2, batch_size: 50, batch_timeout: 2_000]
]
)
end
@impl true
def handle_message(_processor, %Message{data: url} = msg, _ctx) do
case Scraper.OmniScrape.fetch_structured(url) do
{:ok, fields} -> Message.put_data(msg, fields) |> Message.put_batcher(:db)
{:error, _} -> Message.failed(msg, "fetch_error")
end
end
@impl true
def handle_batch(:db, messages, _batch_info, _ctx) do
rows = Enum.map(messages, & &1.data)
Scraper.Repo.insert_all("products", rows, on_conflict: :replace_all)
messages
end
end
8.js_rendering for rendered DOM
Floki operates on raw HTML — it does not evaluate JavaScript, execute XHR calls, or wait for React hydration. If `Req.get` returns a skeleton document with empty product containers, the page requires JavaScript execution. Use `mode: "js_rendering"` in the OmniScrape request to run a headless browser server-side.
Pair `js_rendering` with `js_wait_selector` to block until a specific DOM element appears before the snapshot is taken. This avoids arbitrary `js_wait_timeout` sleeps and produces a stable HTML response regardless of network jitter on the rendering host. Set `js_wait_timeout` as a ceiling — the request resolves as soon as the selector matches, not after the full timeout.
See scraping JavaScript-rendered pages for a full breakdown of when to use `js_rendering` versus `auto` mode escalation.
1234567891011121314151617181920212223242526272829303132333435api_key = System.fetch_env!("OMNISCRAPE_KEY")
resp =
Req.post!("https://api.omniscrape.io/v1/scrape",
headers: [{"X-API-Key", api_key}],
json: %{
url: "https://spa-store.com/catalog",
mode: "js_rendering",
output_format: "html",
js_wait_selector: ".product-card",
js_wait_timeout: 12_000
},
receive_timeout: 120_000
)
body = resp.body
if body["success"] do
html = get_in(body, ["data", "content"])
{:ok, doc} = Floki.parse_document(html)
products =
doc
|> Floki.find(".product-card")
|> Enum.map(fn card ->
%{
name: card |> Floki.find(".product-name") |> Floki.text() |> String.trim(),
price: card |> Floki.find(".product-price") |> Floki.text() |> String.trim()
}
end)
IO.inspect(products, label: "products")
else
IO.puts("Request failed: #{inspect(body)}")
end
9.Let it crash — with budget caps
OTP's "let it crash" philosophy means you trust the supervisor to restart failed workers rather than defensive-coding every possible failure. That is sound for transient network errors. It is not a license to ignore billing signals or spin up infinite retries against a permanently blocked domain.
Map each HTTP status and API error to a concrete action. Treat 401 as a boot-time configuration error — fail fast before any work starts. Treat 402 as a billing ceiling — pause the Broadway producer and alert on-call rather than burning through a credit overage. Treat 429 as a rate signal — implement exponential backoff in GenServer state, not a bare `Process.sleep`. Treat 502 as a safe retry — the supervisor handles it automatically with a restart delay.
- 401 — missing or invalid API key: raise at application boot, do not start the supervisor tree
- 402 — billing limit reached: emit a telemetry event, pause Broadway producer, alert on-call
- 429 — rate limited: exponential backoff in GenServer retry state, cap at 5 attempts
- 502 / 503 — upstream error: safe to retry via Task.Supervisor restart or Broadway nack
- success: false with known error code — log structured error, route to dead-letter queue, no infinite restart
- Floki.parse_document error — malformed HTML: log raw response for inspection, skip record, do not crash worker
- js_wait_selector timeout — selector never appeared: check if page structure changed, alert if error rate exceeds threshold
Frequently asked questions
Should I use Req or HTTPoison in a new Elixir project?
Req. It is actively maintained, ships with composable middleware for retries, compression, and JSON encoding out of the box, and has a cleaner API than HTTPoison. HTTPoison wraps hackney and carries legacy design decisions — avoid it in greenfield projects. Mint and Finch are good lower-level alternatives if you need connection pooling without the Req abstraction.
When should I use Floki versus OmniScrape's css_extractor output format?
Use css_extractor when the fields you need are fixed and well-defined — the API does the extraction server-side and returns a structured map, so your Elixir code never touches raw HTML. Use Floki when you need to traverse complex or variable DOM structures, archive the full HTML for later reprocessing, or extract fields whose selectors vary by page type. For Broadway pipelines at high throughput, css_extractor reduces per-message CPU cost significantly.
Floki or SweetXml for parsing?
Floki for HTML scraped from websites. SweetXml is the right tool when you have genuine XML — RSS feeds, sitemaps, API responses with XML content types. Feeding tag-soup HTML into SweetXml produces unreliable results because browsers are far more lenient than XML parsers about malformed markup.
How do I choose max_concurrency for Task.async_stream against OmniScrape?
Start at 5 and watch two metrics: 429 response rates in your telemetry and the `billing.charged` field per request. If you see no 429s and billing looks expected, increment by 5 and observe again. The right ceiling depends on your OmniScrape plan's concurrency allowance and the target site's tolerance — there is no universal number. Instrument with :telemetry.execute/3 so you can graph this without changing code.
Is Phoenix LiveView useful for scraping infrastructure?
Yes, as an internal operations dashboard. LiveView's PubSub integration makes it straightforward to stream job progress, error rates, and billing spend in real time without polling. It is not appropriate as a customer-facing interface for scraped data — those pages should read from a cache or database, not trigger live scrape requests.
How do I handle pagination across many pages without blocking the supervisor?
Model pagination as a GenServer that holds a cursor (page number or next-page URL) in state. On each `handle_info(:next_page, state)` call, fetch one page, parse results, persist them, then send `self()` a `:next_page` message with an updated cursor. The GenServer processes pages sequentially per domain but multiple GenServers run concurrently under the supervisor — one per domain or job. This avoids spawning an unbounded number of tasks and keeps memory usage predictable.
What is the difference between mode auto and js_rendering in OmniScrape?
mode: "auto" tries the fast HTTP lane first and escalates to a headless browser automatically if the response looks like a bot challenge or empty skeleton. It is the right default for most targets. mode: "js_rendering" forces headless browser execution unconditionally — use it when you already know the page requires JavaScript and you want to skip the fast-lane attempt to avoid the escalation latency. For bot-protected pages, combine mode: "auto" with enable_solver: true.
Related guides