Cheerio Web Scraping: A Practical Guide

Cheerio brings jQuery selector syntax to Node.js servers — fast, familiar to frontend developers, tiny compared to JSDOM. It parses static HTML strings; it does not run client JavaScript. That is fine when OmniScrape already returned post-challenge, post-render HTML.

Pattern A: fetch from OmniScrape, cheerio.load(html). Pattern B: puppeteer.connect to BaaS, then cheerio.load on page.content(). See Puppeteer guide and OmniScrape vs Apify for Crawlee integration.

1.When to use Cheerio

Next.js API routes, Express microservices, Lambda functions parsing HTML without headless Chrome cold start. Frontend engineers writing scrape code without learning XPath.

jQuery-style .find() and .each()
Medium documents — faster than JSDOM
Worker threads for CPU-heavy parse batches
Apify/Crawlee handlers after OmniScrape fetch

2.Where Cheerio breaks

cheerio.load on empty SPA shell returns empty .price — fetch layer must render JS first via OmniScrape js_rendering or js_wait_selector.

innerText semantics differ from browser — test on real OmniScrape HTML samples. No form submit or click simulation.

No client JS execution
Malformed HTML parse differs from Chrome DOM
Silent selector misses until QA
Direct fetch to protected sites fails before Cheerio runs

3.Pattern A — fetch + Cheerio

Validate json.success and non-empty fields before returning — Cheerio fails silently on bad selectors.

pattern-a.ts

javascript

123456789101112131415161718192021222324252627282930313233343536import fetch from 'node-fetch';
import * as cheerio from 'cheerio';

const API_KEY = process.env.OMNISCRAPE_KEY!;

async function scrapePdp(url: string) {
  const res = await fetch('https://api.omniscrape.io/v1/scrape', {
    method: 'POST',
    headers: {
      'X-API-Key': API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      url,
      mode: 'auto',
      output_format: 'html',
      js_wait_selector: '.price',
      js_wait_timeout: 8000,
    }),
  });
  const json = await res.json();
  if (!json.success) throw new Error(JSON.stringify(json));

  const $ = cheerio.load(json.data.content);
  const price = $('.price').first().text().trim();
  const title = $('h1').first().text().trim();
  if (!price || !title) {
    throw new Error(`empty fields for ${url}`);
  }
  return {
    title,
    price,
    mode: json.metadata.method_used,
    cost: json.billing.charged,
  };
}

4.Validate with Zod

Runtime validation catches price format drift before bad rows hit your database.

validate.ts

typescript

123456789import { z } from 'zod';

const Product = z.object({
  title: z.string().min(1),
  price: z.string().regex(/\$[\d,.]+/),
});

const raw = { title, price };
const parsed = Product.parse(raw);

5.Skip Cheerio when possible

Flat PDP fields are cheaper to maintain as css_extractor JSON than Cheerio selectors.

css_extractor path

javascript

1234567891011const res = await fetch(API, {
  method: 'POST',
  headers: { 'X-API-Key': API_KEY, 'Content-Type': 'application/json' },
  body: JSON.stringify({
    url,
    mode: 'auto',
    output_format: 'css_extractor',
    css_selectors: { title: 'h1', price: '.price' },
  }),
});
const { css_extracted } = (await res.json()).data;

6.Parsing repeating lists

Use .each on repeating cards — css_extractor does not map cleanly to unbounded lists.

list extract

javascript

12345678const items: { title: string; price: string }[] = [];
$('.product-card').each((_, el) => {
  const card = $(el);
  items.push({
    title: card.find('h2').text().trim(),
    price: card.find('.price').text().trim(),
  });
});

7.Pattern B — BaaS then Cheerio

After infinite scroll via BaaS, cheerio.load on page.content() — same selectors as Pattern A.

infinite scroll → cheerio

javascript

1234567891011121314151617181920import puppeteer from 'puppeteer';
import * as cheerio from 'cheerio';

const browser = await puppeteer.connect({
  browserWSEndpoint:
    `wss://browser.omniscrape.io?apikey=${process.env.OMNISCRAPE_KEY}&render_media=false`,
});
const page = await browser.newPage();
await page.goto('https://protected.example/search');
for (let i = 0; i < 5; i++) {
  await page.click('#load-more');
  await page.waitForSelector('.product-card');
}
const html = await page.content();
await browser.disconnect();

const $ = cheerio.load(html);
const titles = $('.product-card h2')
  .map((_, el) => $(el).text().trim())
  .get();

8.Worker threads for CPU parse

Large HTML batches: fetch async in main thread, cheerio.load in worker_threads pool to avoid blocking event loop.

9.Next.js API route note

Run scrape server-side only — never expose OMNISCRAPE_KEY to client components. API route calls OmniScrape, returns JSON to frontend.

10.Checklist

Cheerio is not a crawler — bring URL queue separately.

Verify json.success before cheerio.load
Trim text and validate non-empty
Log metadata.method_used for cost
Archive HTML on selector failures
Prefer css_extractor for flat PDP fields

Frequently asked questions

Cheerio vs JSDOM?

Cheerio faster and smaller; JSDOM closer to browser DOM — usually unnecessary after OmniScrape render.

Cheerio in Apify Actor?

Yes — OmniScrape fetch inside requestHandler, cheerio.load — see Apify comparison guide.

TypeScript types for cheerio?

Use cheerio types package; type your extracted row interfaces explicitly.

Why .text() vs .attr()?

Prices in data-price attributes need .attr('data-price') — inspect OmniScrape HTML sample in DevTools.

ESM import cheerio?

import * as cheerio from 'cheerio' on cheerio 1.x — match your package.json module setting.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.When to use Cheerio

Next.js API routes, Express microservices, Lambda functions parsing HTML without headless Chrome cold start. Frontend engineers writing scrape code without learning XPath.

jQuery-style .find() and .each()
Medium documents — faster than JSDOM
Worker threads for CPU-heavy parse batches
Apify/Crawlee handlers after OmniScrape fetch

2.Where Cheerio breaks

cheerio.load on empty SPA shell returns empty .price — fetch layer must render JS first via OmniScrape js_rendering or js_wait_selector.

innerText semantics differ from browser — test on real OmniScrape HTML samples. No form submit or click simulation.

No client JS execution
Malformed HTML parse differs from Chrome DOM
Silent selector misses until QA
Direct fetch to protected sites fails before Cheerio runs

3.Pattern A — fetch + Cheerio

Validate json.success and non-empty fields before returning — Cheerio fails silently on bad selectors.

pattern-a.ts

javascript

123456789101112131415161718192021222324252627282930313233343536import fetch from 'node-fetch';
import * as cheerio from 'cheerio';

const API_KEY = process.env.OMNISCRAPE_KEY!;

async function scrapePdp(url: string) {
  const res = await fetch('https://api.omniscrape.io/v1/scrape', {
    method: 'POST',
    headers: {
      'X-API-Key': API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      url,
      mode: 'auto',
      output_format: 'html',
      js_wait_selector: '.price',
      js_wait_timeout: 8000,
    }),
  });
  const json = await res.json();
  if (!json.success) throw new Error(JSON.stringify(json));

  const $ = cheerio.load(json.data.content);
  const price = $('.price').first().text().trim();
  const title = $('h1').first().text().trim();
  if (!price || !title) {
    throw new Error(`empty fields for ${url}`);
  }
  return {
    title,
    price,
    mode: json.metadata.method_used,
    cost: json.billing.charged,
  };
}

4.Validate with Zod

Runtime validation catches price format drift before bad rows hit your database.

validate.ts

typescript

123456789import { z } from 'zod';

const Product = z.object({
  title: z.string().min(1),
  price: z.string().regex(/\$[\d,.]+/),
});

const raw = { title, price };
const parsed = Product.parse(raw);

5.Skip Cheerio when possible

Flat PDP fields are cheaper to maintain as css_extractor JSON than Cheerio selectors.

css_extractor path

javascript

1234567891011const res = await fetch(API, {
  method: 'POST',
  headers: { 'X-API-Key': API_KEY, 'Content-Type': 'application/json' },
  body: JSON.stringify({
    url,
    mode: 'auto',
    output_format: 'css_extractor',
    css_selectors: { title: 'h1', price: '.price' },
  }),
});
const { css_extracted } = (await res.json()).data;

6.Parsing repeating lists

Use .each on repeating cards — css_extractor does not map cleanly to unbounded lists.

list extract

javascript

12345678const items: { title: string; price: string }[] = [];
$('.product-card').each((_, el) => {
  const card = $(el);
  items.push({
    title: card.find('h2').text().trim(),
    price: card.find('.price').text().trim(),
  });
});

7.Pattern B — BaaS then Cheerio

After infinite scroll via BaaS, cheerio.load on page.content() — same selectors as Pattern A.

infinite scroll → cheerio

javascript

1234567891011121314151617181920import puppeteer from 'puppeteer';
import * as cheerio from 'cheerio';

const browser = await puppeteer.connect({
  browserWSEndpoint:
    `wss://browser.omniscrape.io?apikey=${process.env.OMNISCRAPE_KEY}&render_media=false`,
});
const page = await browser.newPage();
await page.goto('https://protected.example/search');
for (let i = 0; i < 5; i++) {
  await page.click('#load-more');
  await page.waitForSelector('.product-card');
}
const html = await page.content();
await browser.disconnect();

const $ = cheerio.load(html);
const titles = $('.product-card h2')
  .map((_, el) => $(el).text().trim())
  .get();

8.Worker threads for CPU parse

Large HTML batches: fetch async in main thread, cheerio.load in worker_threads pool to avoid blocking event loop.

9.Next.js API route note

Run scrape server-side only — never expose OMNISCRAPE_KEY to client components. API route calls OmniScrape, returns JSON to frontend.

10.Checklist

Cheerio is not a crawler — bring URL queue separately.

Verify json.success before cheerio.load
Trim text and validate non-empty
Log metadata.method_used for cost
Archive HTML on selector failures
Prefer css_extractor for flat PDP fields

Frequently asked questions

Cheerio vs JSDOM?

Cheerio faster and smaller; JSDOM closer to browser DOM — usually unnecessary after OmniScrape render.

Cheerio in Apify Actor?

Yes — OmniScrape fetch inside requestHandler, cheerio.load — see Apify comparison guide.

TypeScript types for cheerio?

Use cheerio types package; type your extracted row interfaces explicitly.

Why .text() vs .attr()?

Prices in data-price attributes need .attr('data-price') — inspect OmniScrape HTML sample in DevTools.

ESM import cheerio?

import * as cheerio from 'cheerio' on cheerio 1.x — match your package.json module setting.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.