OmniScrape
ProductsSolutionsGuidesDocs ↗PricingAbout
ProductsSolutionsGuidesDocs ↗PricingAbout
← All guides
Web Scraping by Language

Web Scraping with Perl: Mojo::UserAgent, Mojo::DOM, and OmniScrape

Perl built much of the early web automation culture. Somewhere on an aging VPS, a cron job still runs LWP::Simple and regex against HTML that no longer exists. Mojo::UserAgent is the upgrade path — non-blocking I/O, built-in JSON helpers, and Mojo::DOM for CSS-selector-based parsing without pretending regex is a parser. The Mojolicious ecosystem is actively maintained and handles TLS, redirects, cookies, and keep-alive connections out of the box.

Cloudflare does not care that your script shipped in 2008. When Mojo gets challenge HTML instead of product data, the right move is to POST to the OmniScrape API and feed the returned data.content to Mojo::DOM. OmniScrape's Web Unlocker handles TLS fingerprinting, CAPTCHA solving, and browser emulation externally, so your Perl script stays simple. For a comparison with a greener stack, see web scraping with Python.

On this page

1. Installing Mojolicious and Mojo::UserAgent2. Fetching Pages with Mojo::UserAgent3. Parsing HTML with Mojo::DOM4. Retiring LWP and Regex-Based Parsers5. Calling OmniScrape from Mojo::UserAgent6. Non-Blocking Concurrent Fetches with Mojo::IOLoop7. Structured Extraction Without DOM Parsing8. JavaScript-Rendered SPAs Require js_rendering Mode9. Unicode Handling, Cron Hygiene, and Secret Management10. Robust Error Handling for Long-Running Cron Jobs11. FAQ

1.Installing Mojolicious and Mojo::UserAgent

Mojolicious ships Mojo::UserAgent as part of its core distribution — no separate install required. You do not need a full web application; the user agent works perfectly from one-liner scripts, cron jobs, and standalone CLI tools. cpanm is the fastest installer; fall back to the system cpan if cpanm is unavailable.

On Debian/Ubuntu systems you can also install via apt (libmojolicious-perl), but the CPAN version is typically several releases ahead. Pinning a specific version in a cpanfile ensures reproducible deployments across servers.

terminal
bash
12345678# Install cpanm first if needed
curl -L https://cpanmin.us | perl - --sudo App::cpanminus

# Then install Mojolicious
cpanm Mojolicious

# Verify
perl -MMojolicious -e 'print Mojolicious->VERSION, "\n"'

2.Fetching Pages with Mojo::UserAgent

ua->get returns a Mojo::Transaction::HTTP object. Calling ->result on it gives you the Mojo::Message::Response. The body method returns the raw response bytes as a string; text returns it decoded according to the Content-Type charset. Always check is_success before proceeding — a 403 or 503 will not throw automatically.

Set max_redirects to handle login redirects and CDN hops. request_timeout is wall-clock seconds from first byte sent to last byte received — critical for slow targets that stall rather than reset. The connect_timeout option caps the TCP handshake phase separately.

fetch.pl
perl
123456789101112131415161718192021222324use strict;
use warnings;
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new(
  max_redirects   => 5,
  request_timeout => 30,
  connect_timeout => 10,
);

# Mimic a real browser User-Agent to avoid trivial blocks
$ua->transactor->name(
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
  . '(KHTML, like Gecko) Chrome/124.0 Safari/537.36'
);

my $url = 'https://books.toscrape.com/catalogue/page-1.html';
my $tx  = $ua->get($url);
my $res = $tx->result;

die "HTTP " . $res->code . " fetching $url\n" unless $res->is_success;

my $html = $res->body;
printf "Fetched %d bytes from %s\n", length($html), $url;

3.Parsing HTML with Mojo::DOM

Mojo::DOM is a CSS-selector-based parser modelled loosely on jQuery's API. at() returns the first matching element or undef — always guard with // to avoid uninitialized-value warnings on missing nodes. find() returns a Mojo::Collection you can chain with each, map, and grep.

The text method strips surrounding whitespace and collapses internal runs. all_text includes text from descendant nodes. attr() fetches element attributes. For attribute values that may be absent, use attr('href') // '' rather than letting undef propagate into your data structure.

parse.pl
perl
1234567891011121314151617181920212223242526use strict;
use warnings;
use Mojo::DOM;

# $html comes from the fetch step above
my $dom   = Mojo::DOM->new($html);
my @books;

for my $card ($dom->find('article.product_pod')->each) {
  my $title_tag = $card->at('h3 a');
  my $price_tag = $card->at('.price_color');
  my $stock_tag = $card->at('.instock');

  push @books, {
    title    => $title_tag ? $title_tag->attr('title') : undef,
    price    => $price_tag ? $price_tag->text          : undef,
    in_stock => $stock_tag
                  ? index($stock_tag->text, 'In stock') >= 0
                  : 0,
  };
}

printf "Found %d books\n", scalar @books;

use JSON::PP;
print JSON::PP->new->pretty->encode($books[0]);

4.Retiring LWP and Regex-Based Parsers

LWP::UserAgent without careful TLS configuration fails on sites that require SNI, HTTP/2, or modern cipher suites. HTML::TreeBuilder and HTML::Parser work, but they require more boilerplate than Mojo::DOM for the same result. Regex-based HTML parsing is fragile by definition — attribute order, whitespace, and encoding variations all cause silent failures in production.

The migration path is straightforward: swap LWP::UserAgent for Mojo::UserAgent (the API surface is similar enough for simple GET/POST scripts), replace HTML::TreeBuilder traversal with Mojo::DOM CSS selectors, and route bot-protected targets through OmniScrape. This extends the useful life of Perl glue scripts without a full rewrite. For background on what the API handles on your behalf, see the Cloudflare bypass guide.

If you cannot touch a legacy script this quarter, wrap it: let OmniScrape fetch the HTML and write it to a temp file, then feed that file to the existing LWP/regex pipeline unchanged. It buys time without a rewrite.

5.Calling OmniScrape from Mojo::UserAgent

Mojo::UserAgent's json => shortcut serializes a Perl hashref to JSON and sets Content-Type: application/json automatically. The response is decoded back to a Perl data structure via ->result->json. Set request_timeout to at least 120 seconds — js_rendering jobs can take 30–60 seconds on complex pages.

The response HTML lives at body->{data}{content}, not data.html. Check body->{success} before accessing it. The metadata.method_used field tells you whether OmniScrape used its fast HTTP path or a headless browser, which is useful for cost tracking and debugging.

omniscrape.pl
perl
123456789101112131415161718192021222324252627282930313233343536use strict;
use warnings;
use Mojo::UserAgent;
use Mojo::DOM;

my $ua = Mojo::UserAgent->new(request_timeout => 120);

my $tx = $ua->post(
  'https://api.omniscrape.io/v1/scrape' => {
    'X-API-Key'    => $ENV{OMNISCRAPE_KEY},
    'Content-Type' => 'application/json',
  } => json => {
    url           => 'https://protected-shop.com/sku/771',
    mode          => 'auto',
    output_format => 'html',
    enable_solver => \1,   # JSON true
    proxy         => 'residential:us',
  }
);

my $res = $tx->result;
die "API HTTP " . $res->code . "\n" unless $res->is_success;

my $body    = $res->json;
die "Scrape failed: " . ($body->{error} // 'unknown') . "\n"
  unless $body->{success};

my $html    = $body->{data}{content};
my $method  = $body->{metadata}{method_used};
my $charged = $body->{billing}{charged};

my $dom   = Mojo::DOM->new($html);
my $price = $dom->at('.product-price') ? $dom->at('.product-price')->text
                                       : 'NOT FOUND';

printf "Price: %s (via %s, cost $%.4f)\n", $price, $method, $charged;

6.Non-Blocking Concurrent Fetches with Mojo::IOLoop

Mojo::UserAgent is fully non-blocking when used inside a Mojo::IOLoop. Passing a callback as the last argument to post() or get() switches the request to async mode. The event loop drives all requests concurrently in a single thread — no fork, no threads, no shared-memory complexity.

Cap parallelism deliberately. Ten simultaneous js_rendering jobs multiply cost linearly and can exhaust your API concurrency quota. Five concurrent requests is a reasonable default for most batch jobs; tune based on your OmniScrape plan limits. Track $active carefully — decrement it in every callback branch, including error paths, or the loop stalls.

async.pl
perl
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960use strict;
use warnings;
use Mojo::UserAgent;
use Mojo::IOLoop;

my @urls = qw(
  https://example.com/product/1
  https://example.com/product/2
  https://example.com/product/3
  https://example.com/product/4
  https://example.com/product/5
  https://example.com/product/6
);

my $ua     = Mojo::UserAgent->new(request_timeout => 120);
my $active = 0;
my $max    = 5;
my @results;

sub scrape_next {
  Mojo::IOLoop->stop and return unless @urls || $active;
  return if $active >= $max || !@urls;

  my $url = shift @urls;
  $active++;

  $ua->post(
    'https://api.omniscrape.io/v1/scrape' => {
      'X-API-Key' => $ENV{OMNISCRAPE_KEY},
    } => json => {
      url           => $url,
      mode          => 'auto',
      output_format => 'html',
      enable_solver => \1,
    } => sub {
      my ($ua, $tx) = @_;
      $active--;

      my $body = eval { $tx->result->json } // {};
      if ($body->{success}) {
        my $html = $body->{data}{content};
        # Parse with Mojo::DOM here
        push @results, { url => $url, length => length($html) };
        printf "OK  %s (%d bytes)\n", $url, length($html);
      } else {
        warn "FAIL $url\n";
      }

      scrape_next();
    }
  );

  # Kick off another request immediately if slots remain
  scrape_next() if $active < $max && @urls;
}

Mojo::IOLoop->next_tick(\&scrape_next);
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

printf "Completed %d/%d URLs\n", scalar @results, scalar @results + scalar @urls;

7.Structured Extraction Without DOM Parsing

When a cron script only needs three fields, css_extractor mode lets OmniScrape apply CSS selectors server-side and return a structured JSON object. You skip Mojo::DOM entirely and read directly from body->{data}{css_extracted}. This reduces response payload size and simplifies the Perl code.

css_selectors is a flat key-value map: keys are arbitrary field names you choose, values are CSS selector strings. The API returns the text content of the first matching element for each selector. If a selector matches nothing, the key is absent from css_extracted — guard with // undef when reading.

css_extractor.pl
perl
12345678910111213141516171819202122232425262728293031use strict;
use warnings;
use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new(request_timeout => 120);

my $tx = $ua->post(
  'https://api.omniscrape.io/v1/scrape' => {
    'X-API-Key' => $ENV{OMNISCRAPE_KEY},
  } => json => {
    url           => 'https://example.com/product/771',
    mode          => 'auto',
    output_format => 'css_extractor',
    enable_solver => \1,
    css_selectors => {
      title       => 'h1',
      price       => '.price-now',
      sku         => '[data-sku]',
      rating      => '.star-rating',
      description => 'meta[name="description"]',
    },
  }
);

my $body = $tx->result->json;
die "Failed\n" unless $body->{success};

my $fields = $body->{data}{css_extracted};
printf "Title: %s\n",  $fields->{title}       // '(missing)';
printf "Price: %s\n",  $fields->{price}       // '(missing)';
printf "SKU:   %s\n",  $fields->{sku}         // '(missing)';

8.JavaScript-Rendered SPAs Require js_rendering Mode

Mojo::UserAgent fetches the raw HTTP response — it does not execute JavaScript. React, Vue, and Angular applications that populate the DOM after page load return an empty shell to a plain HTTP client. For these targets, use mode: js_rendering in the OmniScrape request. The API runs a headless Chromium instance, executes the page JavaScript, and returns the fully rendered HTML.

Pair js_rendering with js_wait_selector to tell the browser to wait until a specific element appears before capturing the DOM. Without it, the snapshot may be taken before async data has loaded. See the scraping JavaScript-rendered pages guide for selector strategies and timeout tuning.

js_rendering.pl
perl
123456789101112my $tx = $ua->post(
  'https://api.omniscrape.io/v1/scrape' => {
    'X-API-Key' => $ENV{OMNISCRAPE_KEY},
  } => json => {
    url              => 'https://spa-example.com/products',
    mode             => 'js_rendering',
    output_format    => 'html',
    js_wait_selector => '.product-grid',   # wait for this to appear
    js_wait_timeout  => 15,               # seconds
    enable_solver    => \1,
  }
);

9.Unicode Handling, Cron Hygiene, and Secret Management

Perl scraping scripts run unattended for years. Small oversights compound into 3 a.m. pages. Apply these rules at the start of every script and review them when onboarding a new target site.

  • Add 'use utf8;' and 'use open qw(:std :utf8);' at the top of every script — this ensures source code and I/O use UTF-8 consistently without per-filehandle binmode calls
  • Store OMNISCRAPE_KEY in the environment (via crontab's OMNISCRAPE_KEY=... line or a secrets manager) — never hardcode it in the script or commit it to version control
  • Log billing.charged and billing.balance_after per run to a structured log file — this gives finance a per-job cost breakdown and alerts you when a pipeline suddenly gets expensive
  • Chunk large URL lists and write progress to a state file between cron ticks — if the job dies at URL 4,000 of 10,000, resume from the checkpoint rather than restarting from zero
  • Rotate User-Agent strings and add realistic request headers (Accept, Accept-Language, Referer) when calling targets directly — OmniScrape handles this automatically when you route through the API
  • Set a hard wall-clock timeout in the cron entry itself (e.g., 'timeout 300 perl scrape.pl') so a stalled script does not block the next scheduled run

10.Robust Error Handling for Long-Running Cron Jobs

Exit non-zero on hard failures so cron's MAILTO mechanism emails you. Distinguish between errors that require immediate human attention and those that are safe to retry automatically. Log the full response body on unexpected status codes — a truncated error message is rarely enough to diagnose the root cause.

  • 401 Unauthorized — die immediately with a clear message; the API key is wrong or revoked and retrying will not help
  • 402 Payment Required — die and disable the crontab entry; top up the account balance before resuming
  • 429 Too Many Requests — sleep for 30–60 seconds and retry once; if the second attempt also 429s, die and let cron reschedule
  • 502/503/504 Gateway errors — retry up to 3 times with exponential backoff (5s, 15s, 45s) and a small random jitter to avoid thundering-herd on shared infrastructure
  • success: false with a non-HTTP error — warn and continue the batch; log the url and error field so you can investigate the specific target offline
  • Mojo::UserAgent connection timeout — treat as a transient network failure; retry once, then skip and log; do not die unless the entire batch is failing
  • Malformed JSON response (eval {} around ->json) — log the raw body and skip; this indicates an upstream proxy issue, not a bug in your selectors

Frequently asked questions

Should I use Mojo::UserAgent or LWP::UserAgent for new Perl scraping projects?

Mojo::UserAgent for anything new. It handles modern TLS, HTTP/2 keep-alive, JSON encoding/decoding, cookies, and non-blocking I/O without additional modules. LWP::UserAgent is appropriate only when you are patching a script you cannot otherwise touch — and even then, wrapping it with OmniScrape for protected targets is easier than fixing LWP's TLS configuration.

Mojo::DOM or HTML::TreeBuilder — which should I use?

Mojo::DOM for new code. CSS selectors are shorter and more readable than TreeBuilder's traversal API, and Mojo::DOM is maintained as part of Mojolicious. Use HTML::TreeBuilder only if you have a large existing codebase that depends on it heavily and the migration cost is not justified.

Do I need a full Mojolicious web application to use Mojo::UserAgent?

No. Mojo::UserAgent is a standalone module. Import it with 'use Mojo::UserAgent;' in any Perl script — cron job, CLI tool, or daemon. A full Mojolicious app makes sense only if you need an admin interface, a webhook receiver, or a REST API on top of your scraping pipeline.

Is Perl still a viable choice for web scraping in production?

For maintaining existing pipelines, yes — particularly when OmniScrape handles modern bot protection externally, so the Perl script only needs to parse clean HTML. For greenfield projects where the team has a choice, most organizations pick Python (requests + BeautifulSoup/Playwright) or Go for easier hiring and broader library support. Perl is not a wrong choice; it is a narrower one.

Why not use WWW::Mechanize for scraping?

WWW::Mechanize simulates a browser from the pre-SPA era — it handles forms and links but has no JavaScript execution, struggles with modern TLS configurations, and does not support async I/O. For static sites it still works, but Mojo::UserAgent + OmniScrape covers the same ground and extends naturally to JavaScript-heavy and bot-protected targets.

How do I handle sites that require cookies or session state across requests?

Mojo::UserAgent maintains a cookie jar automatically across requests made by the same instance. For OmniScrape-routed requests, pass a session_id field in the request body — the API will reuse the same browser session and cookie state across calls that share the same session_id string.

What is the difference between mode auto and mode js_rendering?

mode auto tries a fast HTTP request first and escalates to a headless browser automatically if the response looks like a bot challenge or an empty SPA shell. mode js_rendering always uses a headless browser. Use auto as the default — it is cheaper when the fast path succeeds. Switch to js_rendering explicitly when you know the target always requires JavaScript execution and you want to avoid the auto-detection overhead.

Related guides

  • Web Scraping with Python
  • How to Bypass Cloudflare When Web Scraping
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

Ready to get started?

Start scraping protected sites today — no credit card required.

OmniScrape

Web scraping infrastructure for developers. One API call to bypass any protection.

All systems operational

Product

  • Web Unlocker
  • Browser-as-a-Service
  • Residential Proxies
  • Pricing

Developers

  • API Reference ↗
  • Quickstart ↗
  • All Guides
  • Use Cases
  • Status

Company

  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • Cookie Policy
  • Acceptable Use

Solutions

  • E-commerce Web Scraping: Catalog Intelligence at Production Scale
  • Real Estate Web Scraping: Listings, Comps, and Market Data
  • SERP Web Scraping: Agency Rank Tracking Workflow
  • Job Board Web Scraping: HR Tech Pipeline for Labor Market Intelligence
  • Price Monitoring with Web Scraping: A Practical Developer Guide
  • Lead Generation Web Scraping: Compliant Inbound Enrichment for Sales Teams
  • Market Research Web Scraping: Multi-Geo Data Collection for Research Firms
  • Sentiment Analysis Web Scraping: Build a Production Review Pipeline
  • Logistics Web Scraping: Carrier Rates, Port ETAs, and Sailing Schedules
  • Social Media Web Scraping: Brand Mention Monitoring from Public Pages
  • LLM Training Data Scraping: Building Clean Web Corpora
  • Travel Web Scraping: Hotel Rates, Flight Fares & Parity Monitoring

Web Scraping by Language

  • Web Scraping with Python
  • Web Scraping with Node.js: fetch, Cheerio, and the OmniScrape API
  • Web Scraping with Java: HttpClient, Jsoup, and OmniScrape API
  • Web Scraping with PHP
  • Web Scraping with Go (Golang)
  • Web Scraping with Ruby: Faraday, Nokogiri, Sidekiq & OmniScrape
  • Web Scraping with C#: HttpClient, AngleSharp, and OmniScrape API
  • Web Scraping with Rust
  • Web Scraping with R: httr2, rvest, and the OmniScrape API
  • Web Scraping with C++
  • Web Scraping with Elixir
  • Web Scraping with Perl: Mojo::UserAgent, Mojo::DOM, and OmniScrape

Anti-Bot Bypass

  • How to Bypass Cloudflare When Web Scraping
  • How to Bypass DataDome When Web Scraping
  • How to Bypass Akamai Bot Manager When Web Scraping
  • How to Bypass PerimeterX (HUMAN Security) When Web Scraping
  • Bypassing AWS WAF When Web Scraping: Rate Rules, Bot Control, and Residential Proxies
  • How to Bypass Imperva (Incapsula) When Web Scraping
  • How to Bypass Kasada Bot Protection When Web Scraping
  • How to Bypass F5 BIG-IP Bot Defense When Web Scraping
  • How to Bypass Distil Networks When Web Scraping
  • How to Bypass reCAPTCHA When Web Scraping

Scraping Tools

  • Playwright Web Scraping: Practical Patterns for Protected Sites
  • Puppeteer Web Scraping: Patterns, Anti-Bot Limits, and BaaS Integration
  • Selenium Web Scraping: Practical Patterns for Real-World Projects
  • Scrapy Web Scraping with OmniScrape: Download Middleware, Pipelines, and Scale
  • Beautiful Soup Web Scraping: A Practical Guide
  • cURL Web Scraping: Shell-Native Patterns with OmniScrape
  • HTTPX Web Scraping: Async Python with OmniScrape
  • Cheerio Web Scraping: A Practical Guide

Site-Specific Scrapers

  • Amazon Scraper: Product Data, Buy Box, Reviews, and Multi-Marketplace
  • Google Search Scraper: Extract SERP Rankings and Features
  • Google Maps Scraper: Extract Business Listings and Place Data
  • LinkedIn Scraper: Companies, Jobs, and Public Profiles
  • Walmart Scraper: Prices, Stock, Rollback Deals, and Fulfillment Data
  • eBay Scraper: Extract Listings, Auctions, and Sold Prices
  • Shopify Scraper: Products, Variants, and JSON Endpoints
  • Indeed Scraper: Extract Job Listings, Salaries, and Company Data
  • Zillow Scraper: Extract Listings, Zestimates, and Price History
  • Reddit Scraper: Posts, Comments, and Subreddit Data
  • X (Twitter) Scraper: Tweets, Profiles, and Hashtags
  • Instagram Scraper: Posts, Reels, and Profile Metrics
  • TikTok Scraper: Extract Videos, Hashtags, and Trend Data
  • YouTube Scraper: Extract Video Metadata, Comments, and Channel Stats
  • Booking.com Scraper: Hotel Rates, Room Types, and Availability
  • Airbnb Scraper: Listings, Calendars, and Nightly Rates
  • Crunchbase Scraper: Extract Funding Rounds, Companies, and Investors
  • Yelp Scraper: Extract Business Listings, Ratings, and Reviews
  • Glassdoor Scraper: Employer Ratings, Salaries, and Review Data
  • Trustpilot Scraper: TrustScore, Star Distribution, and Review Monitoring

How We Compare

  • OmniScrape vs ScrapingBee
  • OmniScrape vs ZenRows
  • OmniScrape vs ScraperAPI: A Practical Developer Comparison
  • OmniScrape vs Bright Data: Which Web Scraping Platform Fits Your Team?
  • OmniScrape vs Oxylabs
  • OmniScrape vs Smartproxy
  • OmniScrape vs Crawlbase: API Design, Observability, and Migration Guide
  • OmniScrape vs Apify

Web Scraping Guides

  • Web Scraping Without Getting Blocked
  • Web Scraping Proxy Guide: Types, Sessions, Geo, and OmniScrape Integration
  • Solve CAPTCHAs While Web Scraping
  • Web Scraping vs Web Crawling: Architecture, Patterns, and When to Use Each
  • Headless Browser Scraping: When to Use It and How to Do It Right
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns
  • Rotating Proxies for Web Scraping: Policies, Session Binding, and Geo Pools
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs

© 2026 OmniScrape. All rights reserved.

PrivacyTermsRefundsAcceptable Use