1.Node 18+, ESM modules, and project setup
Node 18 shipped native fetch as a stable global, removing the long-standing dependency on node-fetch or cross-fetch polyfills. If you are on Node 16 or earlier, upgrade — the rest of this guide assumes the built-in fetch is available.
Set `"type": "module"` in package.json to enable top-level await and import/export syntax without a build step. Use .mjs extensions if you need to mix CJS and ESM in the same repo. Cheerio is the only required dependency for static-page scraping; install axios only if your team already uses it or you want its interceptor API for shared client modules.
Keep your OMNISCRAPE_KEY in a .env file and load it with Node's built-in `--env-file` flag (Node 20.6+) or the dotenv package. Never commit API keys to version control.
123456789# Minimum setup — fetch is built in on Node 18+
npm init -y
npm install cheerio
# Optional: axios for interceptors and shared client modules
npm install axios
# Optional: dotenv for Node < 20.6
npm install dotenv
2.Fetch a page with native fetch
Start with books.toscrape.com — a sandboxed site built for scraper tutorials that never blocks you. Validating your selector logic here before pointing at a production target saves debugging time: if this returns HTML and your real target returns a challenge page, the problem is bot protection, not your code.
Set an explicit AbortSignal timeout so hung connections do not block your event loop indefinitely. Thirty seconds is a reasonable ceiling for a single page fetch; lower it to ten seconds for bulk jobs where you want fast failure and retry.
Check `res.ok` before calling `res.text()`. A 403 or 503 will still resolve the fetch promise — you need to inspect the status explicitly.
1234567891011121314const url = "https://books.toscrape.com/catalogue/page-1.html";
const res = await fetch(url, {
headers: {
"User-Agent": "OmniScrapeTutorial/1.0",
"Accept-Language": "en-US,en;q=0.9",
},
signal: AbortSignal.timeout(30_000),
});
if (!res.ok) throw new Error(`HTTP ${res.status} — ${res.statusText}`);
const html = await res.text();
console.log(`Fetched ${html.length.toLocaleString()} bytes`);
3.Parse HTML with Cheerio
Cheerio implements a jQuery-compatible selector API on the server. Load the HTML string once with `cheerio.load()`, then query with the same CSS selectors you copy from Chrome DevTools. The API is synchronous — no async/await inside the parsing step.
Cheerio does not execute JavaScript. If the page hydrates content client-side, the HTML string you pass to `cheerio.load()` contains an empty shell and your selectors return nothing. That case requires `mode: "js_rendering"` through OmniScrape — covered in the scraping JavaScript-rendered pages guide and in the js_rendering section below.
Use `.attr()` for href and src attributes, `.text().trim()` for visible text, and `.html()` when you need the inner markup of an element. Chain selectors with `.find()` to scope queries to a parent element rather than the whole document.
12345678910111213141516171819import * as cheerio from "cheerio";
const $ = cheerio.load(html);
const books = [];
$("article.product_pod").each((_, el) => {
const card = $(el);
books.push({
title: card.find("h3 a").attr("title"),
price: card.find(".price_color").text().trim(),
rating: card.find("p.star-rating").attr("class")?.split(" ")[1] ?? null,
inStock: card.find(".instock").text().includes("In stock"),
detailUrl: "https://books.toscrape.com/catalogue/" +
card.find("h3 a").attr("href"),
});
});
console.log(`Parsed ${books.length} books`);
console.log(JSON.stringify(books.slice(0, 2), null, 2));
4.axios when you want interceptors
Native fetch is sufficient for standalone scripts. axios becomes worthwhile when you are building a shared scraping client used across multiple services in a monorepo: request interceptors let you attach the API key centrally, response interceptors let you log billing data and throw typed errors, and the instance pattern keeps configuration out of individual call sites.
axios also handles JSON serialization and deserialization automatically — you pass a plain object to `data` and receive a parsed object in `response.data` without calling `.json()` manually. The timeout is in milliseconds and applies to the full request lifecycle, not just the connection.
Set the timeout to at least 120 seconds for `js_rendering` requests — browser renders can take 20–30 seconds on complex SPAs before the wait selector appears.
123456789101112131415161718192021222324252627282930import axios from "axios";
import * as cheerio from "cheerio";
const omniscrape = axios.create({
baseURL: "https://api.omniscrape.io",
timeout: 120_000,
headers: {
"Content-Type": "application/json",
"X-API-Key": process.env.OMNISCRAPE_KEY,
},
});
// Log billing on every response
omniscrape.interceptors.response.use((res) => {
const b = res.data?.billing;
if (b) console.log(`[billing] charged=${b.charged} balance=${b.balance_after}`);
return res;
});
const { data } = await omniscrape.post("/v1/scrape", {
url: "https://protected-shop.com/deals",
mode: "auto",
output_format: "html",
enable_solver: true,
});
if (!data.success) throw new Error(`Scrape failed: ${JSON.stringify(data)}`);
const $ = cheerio.load(data.data.content);
console.log($("h1").first().text().trim());
5.Batch URLs with Promise.allSettled
Node's event loop handles concurrent I/O efficiently. You can fire multiple OmniScrape requests in parallel, but uncapped concurrency causes problems: ten simultaneous `js_rendering` renders consume browser slots quickly, inflate costs, and trigger 429 rate-limit responses. A simple pool pattern — spawn N workers that each pull from a shared queue — keeps in-flight work bounded without a full queue library.
Use `Promise.allSettled` rather than `Promise.all` so a single failed URL does not abort the entire batch. Inspect each result's `status` field to separate successes from failures and dead-letter the bad URLs for review.
For bulk jobs with hundreds of URLs, prefer `output_format: "css_extractor"` with `css_selectors` to avoid loading Cheerio at all — the API extracts fields server-side and returns a structured object directly.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960const URLS = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3",
"https://example.com/product/4",
"https://example.com/product/5",
];
async function scrapeOne(url) {
const res = await fetch("https://api.omniscrape.io/v1/scrape", {
method: "POST",
headers: {
"Content-Type": "application/json",
"X-API-Key": process.env.OMNISCRAPE_KEY,
},
body: JSON.stringify({
url,
mode: "auto",
output_format: "css_extractor",
css_selectors: {
title: "h1",
price: "[data-price], .price, .product-price",
sku: "[data-sku]",
},
}),
signal: AbortSignal.timeout(60_000),
});
const body = await res.json();
if (!body.success) throw new Error(`API error for ${url}: ${JSON.stringify(body)}`);
return { url, fields: body.data.css_extracted, method: body.metadata.method_used };
}
// Pool: limit concurrent in-flight requests
async function mapPool(items, fn, limit = 5) {
const results = new Array(items.length);
let cursor = 0;
async function worker() {
while (cursor < items.length) {
const idx = cursor++;
results[idx] = await fn(items[idx]);
}
}
await Promise.allSettled(Array.from({ length: Math.min(limit, items.length) }, worker));
return results;
}
const settled = await Promise.allSettled(
(await mapPool(URLS, scrapeOne, 5)).map((r) => Promise.resolve(r))
);
for (const [i, result] of settled.entries()) {
if (result.status === "fulfilled") {
console.log(`✓ ${result.value.url}`, result.value.fields);
} else {
console.error(`✗ ${URLS[i]}`, result.reason?.message);
}
}
6.When fetch gets blocked
Retailers, travel aggregators, financial data sites, and anything behind Cloudflare, Akamai, or DataDome will return 403 responses, redirect loops, or challenge HTML to any request originating from a datacenter IP range. Rotating User-Agent strings and adding Accept-Language headers may bypass the simplest checks but will not defeat TLS fingerprinting or behavioral analysis — and maintaining that bypass stack is an ongoing engineering cost.
The practical decision point: if a target consistently blocks raw fetch and you do not want to maintain browser automation, proxy rotation, and fingerprint patching yourself, delegate those requests to OmniScrape. The swap is a single function change — POST the URL to the API, read `data.content`, pass it to Cheerio. Your selector logic is untouched. See Cloudflare bypass for a detailed breakdown of what each protection layer does.
Set `enable_solver: true` alongside `mode: "auto"` for pages that serve CAPTCHA challenges. The API handles challenge solving and retries automatically; `metadata.challenge_solved` in the response confirms whether a challenge was encountered.
7.OmniScrape fetch with mode auto
POST to `https://api.omniscrape.io/v1/scrape` with `X-API-Key` in the request header. `mode: "auto"` tries the fast HTTP path first and escalates to a real headless browser only when the server signals bot detection. This keeps costs low for pages that do not require JavaScript — you pay browser-render rates only when necessary.
The response HTML is in `body.data.content`. Check `body.metadata.method_used` to see whether the request was served by the fast lane or a browser render. This field is useful for cost attribution and for diagnosing why a particular URL is consistently escalating to `js_rendering`.
Wrap the fetch in a try/catch and check both the HTTP status and `body.success`. A 200 response with `success: false` means the API reached the target but could not extract content — log the full response body and dead-letter the URL rather than retrying immediately.
123456789101112131415161718192021222324252627282930import * as cheerio from "cheerio";
const resp = await fetch("https://api.omniscrape.io/v1/scrape", {
method: "POST",
headers: {
"Content-Type": "application/json",
"X-API-Key": process.env.OMNISCRAPE_KEY,
},
body: JSON.stringify({
url: "https://protected-shop.com/product/8821",
mode: "auto",
output_format: "html",
enable_solver: true,
}),
signal: AbortSignal.timeout(90_000),
});
if (!resp.ok) throw new Error(`OmniScrape HTTP ${resp.status}`);
const body = await resp.json();
if (!body.success) throw new Error(`Scrape failed: ${JSON.stringify(body)}`);
const $ = cheerio.load(body.data.content);
const price = $(".product-price, [data-price]").first().text().trim();
const title = $("h1").first().text().trim();
console.log({ title, price });
console.log(`Method: ${body.metadata.method_used}`);
console.log(`Solver used: ${body.metadata.solver_used}`);
console.log(`Charged: ${body.billing.charged}, balance: ${body.billing.balance_after}`);
8.js_rendering for SPAs and lazy-loaded content
React, Vue, and Next.js storefronts often ship a minimal HTML shell and populate product data after the JavaScript bundle executes. Sending that shell to Cheerio returns empty strings for every price and title selector. `mode: "js_rendering"` launches a real headless browser, executes JavaScript, and waits for the DOM to settle before returning HTML.
Use `js_wait_selector` to specify a CSS selector that must appear in the DOM before the page is considered ready. This is more reliable than a fixed `js_wait_timeout` because it adapts to actual render time rather than a worst-case estimate. Set `js_wait_timeout` as a ceiling — the API returns whatever is rendered if the selector does not appear within the timeout.
Reserve `js_rendering` for URLs that actually need it. For sites where only some pages are SPAs, use `mode: "auto"` — it escalates automatically and avoids paying browser-render rates for static pages.
1234567891011121314151617181920212223242526272829303132const resp = await fetch("https://api.omniscrape.io/v1/scrape", {
method: "POST",
headers: {
"Content-Type": "application/json",
"X-API-Key": process.env.OMNISCRAPE_KEY,
},
body: JSON.stringify({
url: "https://spa-store.com/catalog",
mode: "js_rendering",
output_format: "html",
js_wait_selector: ".product-card",
js_wait_timeout: 15000,
}),
signal: AbortSignal.timeout(120_000),
});
const body = await resp.json();
if (!body.success) throw new Error(`Render failed: ${JSON.stringify(body)}`);
// body.data.content is the fully rendered HTML
const $ = cheerio.load(body.data.content);
const products = [];
$(".product-card").each((_, el) => {
products.push({
name: $(el).find(".product-name").text().trim(),
price: $(el).find(".product-price").text().trim(),
id: $(el).attr("data-product-id"),
});
});
console.log(`Extracted ${products.length} products from SPA`);
9.Running scrapers outside request handlers
Do not call OmniScrape from a Next.js API route or Express handler that a user is waiting on. Scrape latency is measured in seconds — sometimes tens of seconds for JavaScript-heavy pages — not the milliseconds users expect from a web response. Coupling scraping to request handlers also means a slow or failed scrape degrades your user-facing API.
The correct architecture separates scraping from serving: a scheduled job or background worker fetches and stores data; your web application reads from the database. On Vercel, use Vercel Cron Jobs. On a Node server, BullMQ with a Redis backend gives you retries, dead-letter queues, and concurrency control out of the box. A plain `setInterval` works for low-frequency jobs in a dedicated process.
Operational checklist for production scraping workers:
- Store OMNISCRAPE_KEY in environment secrets — never in client bundles or committed config files
- Log billing.charged and billing.balance_after per job for per-target cost attribution
- Use output_format css_extractor in workers when the set of fields is fixed — skip Cheerio entirely
- Retry 429 and 502 responses with exponential backoff and jitter; cap at 3 attempts
- Never retry 401 (bad key) or 402 (balance exhausted) — alert and pause the job queue
- Dead-letter 200 + success:false responses for manual review rather than retrying blindly
- Emit structured logs (JSON) so billing and error data is queryable in your observability stack
10.Status codes and error handling
OmniScrape HTTP status codes map to distinct failure modes that require different handling strategies. Treating all non-200 responses the same way leads to wasted retries on permanent errors and missed retries on transient ones.
Handle each code explicitly in your retry logic:
- 200 + success:true — parse data.content or data.css_extracted and persist to your store
- 200 + success:false — the API reached the target but extraction failed; log body, dead-letter the URL, do not retry immediately
- 400 — malformed request body; fix the payload, do not retry
- 401 — invalid or missing API key; fix the environment variable, do not retry
- 402 — account balance exhausted; alert your billing contact, pause the job queue
- 429 — rate limit hit; back off exponentially starting at 2 seconds, reduce pool concurrency
- 502 / 503 — transient infrastructure issue; retry up to 3 times with jitter before dead-lettering
- Timeout (AbortError) — increase signal timeout for js_rendering targets; for fast targets, dead-letter after 2 retries
Frequently asked questions
Should I use native fetch or axios for OmniScrape requests?
Native fetch works well for standalone scripts and has zero dependencies on Node 18+. Choose axios when you are building a shared scraping client across a monorepo — its interceptor API lets you attach the API key, log billing data, and throw typed errors in one place rather than repeating that logic in every call site. The OmniScrape API accepts standard JSON over HTTP, so both clients work identically at the protocol level.
Can I use Puppeteer or Playwright instead of OmniScrape?
Yes, and both are good choices for scraping pages you fully control — internal dashboards, authenticated flows, or sites that never block you. For public pages behind Cloudflare, Akamai, or DataDome, you inherit the maintenance burden of browser fingerprint patching, proxy rotation, and CAPTCHA solving. That stack requires ongoing attention as protection services update their detection. OmniScrape handles that layer so you can focus on extraction logic.
Does Cheerio execute JavaScript?
No. Cheerio is a server-side HTML parser — it processes the HTML string you give it and does nothing more. If a page populates its content after JavaScript execution (React, Vue, Angular SPAs), the HTML string you fetch directly will contain an empty shell. Pass those URLs to OmniScrape with mode: "js_rendering" and js_wait_selector, then load the returned data.content into Cheerio as normal.
What concurrency limit should I use for bulk scraping jobs?
Five to ten concurrent requests is a reasonable default for mode: "auto" jobs. Drop to three to five for mode: "js_rendering" — browser renders are more resource-intensive and consume API capacity faster. Monitor 429 responses: if you see them, lower the pool limit and add jitter between batches. For very large jobs (thousands of URLs), consider splitting across multiple scheduled runs rather than one large burst.
How do I handle pages that require login or session cookies?
Use the custom_headers field to pass Cookie or Authorization headers to OmniScrape. For multi-step authenticated flows, use session_id to maintain a browser session across requests — the API reuses the same browser context, preserving cookies set during login. Avoid storing credentials in source code; pull them from environment secrets at runtime.
Is TypeScript worth using for scraping projects?
Yes, especially for anything beyond a one-off script. Defining interfaces for OmniScrape response shapes (success, data.content, data.css_extracted, billing, metadata) catches access errors at compile time rather than runtime. It also makes refactoring safer when you change selectors or add new fields. The compiled output is standard JavaScript — no runtime overhead.
What is the difference between output_format html and css_extractor?
html returns the full page HTML in data.content, which you then parse with Cheerio. css_extractor performs the field extraction server-side using the css_selectors map you provide, returning a structured object in data.css_extracted. Use css_extractor in production workers where the fields are fixed — it eliminates the Cheerio dependency, reduces payload size, and simplifies the parsing step to a direct property access.
Related guides