1.When to Use Puppeteer
Puppeteer is the right tool when you need tight Chrome integration in a Node.js or TypeScript codebase. Its CDP surface is the most complete of any headless library: you can intercept and modify network responses at the byte level, emulate CPU throttling and network conditions, capture full-page screenshots with pixel-perfect accuracy, and generate PDFs with custom headers and footers — all without leaving JavaScript.
Teams with existing Puppeteer scripts benefit from the large Stack Overflow archive and the puppeteer-extra plugin ecosystem. If your pipeline already produces PDFs or screenshots, migrating to a different library is rarely worth the cost. Where Puppeteer struggles is on hardened bot-protected pages; for those, the patterns below delegate the risky fetch to OmniScrape while keeping your existing locator and parsing code intact.
- Node.js or TypeScript services already running Puppeteer 21+
- Screenshot or PDF export pipelines requiring Chrome fidelity
- CDP-level network interception and response mocking
- Scripted multi-step navigation against a BaaS remote browser
- Migrating legacy headless Chrome scripts to a managed browser endpoint
- Generating print-ready PDFs from server-rendered HTML
2.Where Puppeteer Breaks
The core problem is that puppeteer-extra-stealth patches are public. Bot-detection vendors read the same npm registry you do, and they ship fingerprint updates continuously. A patch that works today — spoofing `navigator.webdriver`, faking Chrome's `plugins` array, randomising canvas noise — may be detected within days of a vendor update. You are in a reactive arms race with no structural advantage.
Beyond stealth, operating local Chrome at scale has real infrastructure costs. Each Chromium instance consumes 150–300 MB of RAM under normal conditions; 20 concurrent tabs on a single VM will exhaust memory on most cloud instance types. puppeteer-cluster helps distribute load across workers, but you still own the grid: health checks, crash recovery, Chromium revision pinning, and proxy rotation all fall to your team.
Single-browser focus is also a practical limitation. Puppeteer targets Chromium only — there is no Firefox or WebKit path. For sites that behave differently across browsers, or for fingerprint diversity, you need a different tool or a managed browser endpoint.
- Cloudflare Bot Management on `headless: 'new'` and legacy `headless: true`
- DataDome behavioral scoring and mouse-movement analysis
- PerimeterX / HUMAN fingerprint checks at TLS and HTTP/2 layers
- Memory pressure running 20+ tabs per VM
- puppeteer-extra-stealth detection after vendor fingerprint updates
- No Firefox or WebKit fallback for browser-diversity strategies
- Grid health and crash recovery overhead in self-hosted cluster mode
3.Pattern A — Fetch with OmniScrape API, Parse in Node
Pattern A keeps your parsing logic in Node.js but replaces the raw `fetch` or `page.goto` call with a POST to the OmniScrape API. OmniScrape handles TLS fingerprinting, proxy selection, and challenge solving server-side, then returns clean HTML. You load that HTML into Cheerio for fast extraction, or into a Puppeteer page via `setContent` if you want to reuse existing `page.$eval` locator code.
This pattern is stateless from your side — each request is an independent HTTP POST with no persistent browser process. It scales well in serverless environments (Lambda, Cloud Functions, Cloudflare Workers) because there is no local Chrome to launch, no memory cliff, and no cold-start penalty from spinning up a browser. For high-volume catalog scraping, Pattern A is almost always the right choice.
Set `mode: 'auto'` to let OmniScrape try the fast HTTP lane first and escalate to a headless browser only when the page requires JavaScript execution. Add `enable_solver: true` for Cloudflare-protected targets. The response body follows a consistent shape: `body.success` is a boolean, rendered HTML is at `body.data.content`, and `body.metadata.method_used` tells you whether the fast or JS-rendering path was used — useful for debugging and cost attribution.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152import fetch from 'node-fetch';
import * as cheerio from 'cheerio';
interface OmniScrapeResponse {
success: boolean;
data: {
content: string;
css_extracted?: Record<string, string>;
};
metadata: {
method_used: 'fast' | 'js_rendering';
solver_used: boolean;
challenge_solved: boolean;
};
billing: {
charged: number;
balance_after: number;
};
error?: string;
}
async function fetchProtected(url: string): Promise<string> {
const response = await fetch('https://api.omniscrape.io/v1/scrape', {
method: 'POST',
headers: {
'X-API-Key': process.env.OMNISCRAPE_KEY!,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
mode: 'auto',
output_format: 'html',
enable_solver: true,
}),
});
const body = (await response.json()) as OmniScrapeResponse;
if (!body.success) {
throw new Error(`OmniScrape error: ${body.error ?? JSON.stringify(body)}`);
}
console.log('method_used:', body.metadata.method_used);
console.log('solver_used:', body.metadata.solver_used);
return body.data.content;
}
// Cheerio extraction — fast, no browser process
const html = await fetchProtected('https://shop.example/products/1');
const $ = cheerio.load(html);
const price = $('.price').first().text().trim();
const title = $('h1.product-title').text().trim();
console.log({ title, price });
4.Pattern A — Reuse Puppeteer Locators via setContent
If your codebase already has a library of `page.$eval` and `page.$$eval` selectors, you can reuse them without rewriting to Cheerio. After fetching clean HTML from OmniScrape, create a Puppeteer page, call `page.setContent()` to inject the HTML, then run your existing locator code. The page never makes an outbound network request — it just parses the HTML you provide.
This is a useful migration path: swap `page.goto()` for the OmniScrape fetch + `setContent` combination, keep everything else identical. Once the migration is stable, you can gradually replace `page.$eval` calls with Cheerio for better performance, since Cheerio operates on a static DOM without a browser process.
One caveat: `setContent` does not execute inline `<script>` tags or load external resources. If your locator logic depends on JavaScript-mutated DOM state, you need Pattern B instead.
1234567891011121314151617181920212223242526import puppeteer, { type Browser } from 'puppeteer';
// fetchProtected from Pattern A above
const html = await fetchProtected('https://shop.example/products/1');
let browser: Browser | null = null;
try {
browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Inject OmniScrape HTML — no outbound request from the browser
await page.setContent(html, { waitUntil: 'domcontentloaded' });
// Reuse existing locator code unchanged
const price = await page.$eval('.price', (el) => el.textContent?.trim() ?? '');
const specs = await page.$$eval('.spec-row', (rows) =>
rows.map((r) => ({
label: r.querySelector('.label')?.textContent?.trim(),
value: r.querySelector('.value')?.textContent?.trim(),
}))
);
console.log({ price, specs });
} finally {
await browser?.close();
}
5.Pattern B — puppeteer.connect to OmniScrape Remote Browser
Pattern B replaces `puppeteer.launch()` with `puppeteer.connect()` pointed at OmniScrape's WebSocket browser endpoint. The remote browser runs in OmniScrape's infrastructure with anti-detection patches, residential proxy rotation, and challenge solving applied at the infrastructure layer — none of that is your code's responsibility.
Use Pattern B when you need genuine browser interaction: clicking through pagination, filling login forms, waiting for XHR responses after user events, or navigating multi-step checkout flows. The full Puppeteer API is available — `page.click()`, `page.type()`, `page.waitForSelector()`, `page.evaluate()` — because you are controlling a real remote Chromium instance over CDP.
Always call `browser.disconnect()` when your session is complete. Unlike `browser.close()`, `disconnect()` ends your CDP connection without terminating the remote browser process, which is the correct cleanup for a managed endpoint. Failing to disconnect leaves the session open and continues billing. Wrap the session in a `try/finally` block to guarantee cleanup even on errors.
Prefer `waitForSelector` over `waitForNavigation` with `networkidle2`. Analytics-heavy pages fire network requests continuously, making `networkidle2` unreliable and slow. Waiting for a specific data element to appear is more deterministic and faster.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152import puppeteer from 'puppeteer';
const WS_ENDPOINT =
`wss://browser.omniscrape.io?apikey=${process.env.OMNISCRAPE_KEY}&render_media=false`;
async function scrapeWithInteraction(url: string) {
const browser = await puppeteer.connect({
browserWSEndpoint: WS_ENDPOINT,
});
try {
const page = await browser.newPage();
// Log the session fragment for support correlation on failures
const sessionId = new URL(browser.wsEndpoint()).searchParams.get('sessionId');
console.log('BaaS session:', sessionId);
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30_000 });
// Prefer waitForSelector over networkidle2 on analytics-heavy pages
await page.waitForSelector('.data-row', { timeout: 15_000 });
// Paginate through results
const allRows: string[] = [];
let hasNext = true;
while (hasNext) {
const rows = await page.$$eval('.data-row', (els) =>
els.map((e) => (e as HTMLElement).innerText.trim())
);
allRows.push(...rows);
const nextBtn = await page.$('a.next-page:not([disabled])');
if (nextBtn) {
await Promise.all([
page.waitForSelector('.data-row', { timeout: 10_000 }),
nextBtn.click(),
]);
} else {
hasNext = false;
}
}
return allRows;
} finally {
// disconnect(), not close() — ends CDP session without killing remote browser
await browser.disconnect();
}
}
const rows = await scrapeWithInteraction('https://protected.example/listings');
console.log(`Collected ${rows.length} rows`);
6.Why Stealth Plugins Fail as a Primary Strategy
puppeteer-extra-stealth works by patching a known set of headless Chrome detection vectors: `navigator.webdriver`, the `plugins` and `mimeTypes` arrays, `window.chrome` presence, WebGL renderer strings, and a handful of others. These patches are open source and well-documented. Bot-detection vendors read the same code and build detectors for the patches themselves — the presence of stealth's canvas noise signature, for example, is itself a detection signal on some platforms.
The result is a reactive maintenance cycle. When a detection vendor ships an update, stealth-protected scrapers break. You patch. They detect the patch. The cycle repeats on their schedule. For low-security targets — sites with no active bot management — stealth plugins provide a reasonable baseline. For Cloudflare Bot Management, DataDome, or PerimeterX, they are insufficient as a standalone strategy.
The practical approach is to use stealth plugins as a minor layer of noise on top of a structural solution, not as the solution itself. For fetch-only tasks, Pattern A routes around the problem entirely. For interactive sessions, Pattern B moves anti-detection to OmniScrape's infrastructure where it is maintained continuously and not visible in your public codebase.
7.Choosing Between Pattern A and Pattern B
The decision tree is straightforward: if your task requires browser interaction — clicks, form submission, navigation events, waiting for XHR triggered by user actions — use Pattern B. For everything else, Pattern A is faster, cheaper, and simpler to operate. Static HTML extraction, even from JavaScript-rendered pages, does not require a persistent browser session.
Volume is a secondary factor. Pattern A scales horizontally with no state — each request is an independent HTTP POST. Pattern B sessions are stateful and billed by duration; running hundreds of concurrent sessions requires careful session lifecycle management. For high-volume catalog scraping where pages are independent, Pattern A with `css_extractor` output format is the most efficient path: server-side CSS extraction returns only the fields you need, reducing response size and parse time.
- Static HTML after unlock → Pattern A + Cheerio
- Existing `page.$$eval` locator code → Pattern A + setContent (migrate gradually)
- JavaScript-rendered content, no interaction needed → Pattern A with `mode: 'js_rendering'`
- Login flows, form submission, pagination clicks → Pattern B
- High-volume catalog with stable selectors → Pattern A with `output_format: 'css_extractor'`
- Multi-step checkout or session-dependent state → Pattern B with session_id
8.puppeteer-cluster vs BaaS for Concurrency
puppeteer-cluster distributes work across a pool of local Chromium instances. It handles task queuing, worker lifecycle, and retry logic well — for unprotected targets on controlled infrastructure, it remains a solid choice. The operational cost is that you own everything below the task queue: Chromium revision pinning, memory limits per worker, crash recovery, proxy rotation, and block detection.
A BaaS remote browser endpoint inverts that model. You own the task logic — the Puppeteer script — and the provider owns the infrastructure. Browser crashes, memory pressure, IP blocks, and fingerprint maintenance are handled server-side. The tradeoff is per-session cost and network latency on each CDP command, which adds up on interaction-heavy scripts with many round trips.
Most teams running Pattern A at scale retire puppeteer-cluster for protected targets because the fetch is stateless and parallelises trivially with `Promise.allSettled` or a simple work queue. Pattern B sessions are better suited to interaction-heavy tasks where the per-session cost is justified by the complexity of the automation. Self-hosted cluster remains useful for internal tooling against unprotected targets where infrastructure cost is the primary constraint.
9.Error Handling and Session Debugging
Pattern A errors are standard HTTP failures plus OmniScrape API errors. Check `body.success` before accessing `body.data.content`. The API returns structured error objects with a message field — log the full response body on failure, not just the HTTP status code, since a 200 response can carry `success: false` for application-level errors like unsupported URL schemes or solver timeouts.
Pattern B errors fall into two categories: CDP protocol errors and session lifecycle errors. `ProtocolError` typically means the remote browser session timed out or was terminated — this happens when a session exceeds the maximum duration or when the remote browser crashes. Wrap your `puppeteer.connect` block in a try/catch and log `browser.wsEndpoint()` before the error occurs; the session ID fragment in the WebSocket URL is the key identifier for correlating failures with OmniScrape support logs.
For both patterns, implement exponential backoff on retries. Transient failures — network timeouts, solver challenges that take longer than expected — are common in scraping workloads. A fixed retry with 2–5 second backoff handles most transient cases without hammering the API on persistent failures.
12345678910111213141516171819import puppeteer, { type ProtocolError } from 'puppeteer';
async function connectWithRetry(wsEndpoint: string, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const browser = await puppeteer.connect({ browserWSEndpoint: wsEndpoint });
// Log session ID for support correlation
const url = new URL(browser.wsEndpoint());
console.log('session:', url.searchParams.get('sessionId') ?? url.pathname);
return browser;
} catch (err) {
const isProtocolError = (err as ProtocolError).name === 'ProtocolError';
console.error(`Attempt ${attempt} failed:`, (err as Error).message);
if (attempt === maxRetries || !isProtocolError) throw err;
// Exponential backoff: 2s, 4s, 8s
await new Promise((r) => setTimeout(r, 2 ** attempt * 1000));
}
}
}
10.Pre-Production Checklist
Run through this checklist before deploying a Puppeteer scraping pipeline to production. Most production incidents trace back to one of these items being skipped during development.
- Remove puppeteer-extra-stealth as the sole anti-bot defense on hardened targets
- Confirm Pattern A reads `body.data.content`, not `body.data.html` (wrong field)
- Wrap Pattern B sessions in try/finally with `browser.disconnect()` to stop billing
- Use `waitForSelector` on a data element rather than `waitForNavigation('networkidle2')`
- Prefer `output_format: 'css_extractor'` with `css_selectors` when selectors are stable — reduces response size and parse overhead
- Pin Chromium revision in package-lock.json to avoid unexpected behavior changes on `npm update`
- Implement exponential backoff on retries for both Pattern A HTTP errors and Pattern B ProtocolError
- Log `metadata.method_used` and `metadata.solver_used` from Pattern A responses for cost attribution
- Set `OMNISCRAPE_KEY` via environment variable — never hardcode API keys in source
- Test Pattern A in your serverless environment (Lambda, Cloud Functions) to confirm no local Chrome is launched
Frequently asked questions
Should I use Puppeteer or Playwright with OmniScrape BaaS?
Both connect to the same remote browser endpoint over CDP, so the functional result is identical. Use whichever library your codebase already depends on. If you are starting fresh, Playwright has built-in support for Firefox and WebKit alongside Chromium, which gives you more flexibility for fingerprint diversity. See the Playwright guide for equivalent Pattern A/B examples.
What is the difference between headless: true and headless: 'new' for bot detection?
Both modes are detected by Cloudflare Bot Management and DataDome on default Puppeteer configurations. The 'new' headless mode has a different user agent and fewer legacy detection vectors, but it introduces new ones. Neither mode is reliably undetected on hardened targets without infrastructure-level anti-detection. Use Pattern A or Pattern B instead of tweaking headless flags.
Can I use Pattern A in a serverless function like AWS Lambda?
Yes — Pattern A is a plain HTTP POST with no local browser process. It works in Lambda, Google Cloud Functions, Cloudflare Workers, and Vercel Edge Functions. Avoid calling `puppeteer.launch()` in serverless environments; the memory footprint of Chromium (150–300 MB) conflicts with typical function memory limits, and cold starts are slow. Pattern A sidesteps both problems.
How do I handle JavaScript-rendered pages in Pattern A without Pattern B?
Set `mode: 'js_rendering'` in your OmniScrape request. OmniScrape will execute the page in a headless browser server-side, wait for the DOM to stabilise, and return the rendered HTML. You can also pass `js_wait_selector` to wait for a specific element before capturing, or `js_wait_timeout` for a fixed delay. The response HTML in `body.data.content` reflects the post-JavaScript DOM state.
What TypeScript types should I define for the OmniScrape API response?
Define an interface with: `success: boolean`, `data: { content: string; css_extracted?: Record<string, string> }`, `metadata: { method_used: 'fast' | 'js_rendering'; solver_used: boolean; challenge_solved: boolean }`, `billing: { charged: number; balance_after: number }`, and `error?: string`. Cast the `response.json()` result to this interface and check `body.success` before accessing `body.data.content`.
How do I scrape paginated content efficiently with Pattern A?
For sites where pagination is URL-based (query params or path segments), generate the URL list and run requests concurrently with `Promise.allSettled`. For sites where the next-page URL is embedded in the HTML, extract it with Cheerio after each fetch and chain requests sequentially. Avoid Pattern B for pure pagination unless the page requires a click interaction to load the next set — stateless HTTP requests are significantly cheaper and faster than maintaining a browser session across pages.
Why does my Pattern B session disconnect unexpectedly?
The most common causes are: session timeout (the remote browser has a maximum session duration — check OmniScrape documentation for the current limit), the remote browser crashing due to a page error, or a network interruption between your client and the WebSocket endpoint. Log the session ID from `browser.wsEndpoint()` before the error and include it when contacting support — it allows the OmniScrape team to pull server-side logs for that specific session. Implement retry logic with exponential backoff for ProtocolError exceptions.
Related guides