1.Installing Mojolicious and Mojo::UserAgent
Mojolicious ships Mojo::UserAgent as part of its core distribution — no separate install required. You do not need a full web application; the user agent works perfectly from one-liner scripts, cron jobs, and standalone CLI tools. cpanm is the fastest installer; fall back to the system cpan if cpanm is unavailable.
On Debian/Ubuntu systems you can also install via apt (libmojolicious-perl), but the CPAN version is typically several releases ahead. Pinning a specific version in a cpanfile ensures reproducible deployments across servers.
12345678# Install cpanm first if needed
curl -L https://cpanmin.us | perl - --sudo App::cpanminus
# Then install Mojolicious
cpanm Mojolicious
# Verify
perl -MMojolicious -e 'print Mojolicious->VERSION, "\n"'
2.Fetching Pages with Mojo::UserAgent
ua->get returns a Mojo::Transaction::HTTP object. Calling ->result on it gives you the Mojo::Message::Response. The body method returns the raw response bytes as a string; text returns it decoded according to the Content-Type charset. Always check is_success before proceeding — a 403 or 503 will not throw automatically.
Set max_redirects to handle login redirects and CDN hops. request_timeout is wall-clock seconds from first byte sent to last byte received — critical for slow targets that stall rather than reset. The connect_timeout option caps the TCP handshake phase separately.
123456789101112131415161718192021222324use strict;
use warnings;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new(
max_redirects => 5,
request_timeout => 30,
connect_timeout => 10,
);
# Mimic a real browser User-Agent to avoid trivial blocks
$ua->transactor->name(
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
. '(KHTML, like Gecko) Chrome/124.0 Safari/537.36'
);
my $url = 'https://books.toscrape.com/catalogue/page-1.html';
my $tx = $ua->get($url);
my $res = $tx->result;
die "HTTP " . $res->code . " fetching $url\n" unless $res->is_success;
my $html = $res->body;
printf "Fetched %d bytes from %s\n", length($html), $url;
3.Parsing HTML with Mojo::DOM
Mojo::DOM is a CSS-selector-based parser modelled loosely on jQuery's API. at() returns the first matching element or undef — always guard with // to avoid uninitialized-value warnings on missing nodes. find() returns a Mojo::Collection you can chain with each, map, and grep.
The text method strips surrounding whitespace and collapses internal runs. all_text includes text from descendant nodes. attr() fetches element attributes. For attribute values that may be absent, use attr('href') // '' rather than letting undef propagate into your data structure.
1234567891011121314151617181920212223242526use strict;
use warnings;
use Mojo::DOM;
# $html comes from the fetch step above
my $dom = Mojo::DOM->new($html);
my @books;
for my $card ($dom->find('article.product_pod')->each) {
my $title_tag = $card->at('h3 a');
my $price_tag = $card->at('.price_color');
my $stock_tag = $card->at('.instock');
push @books, {
title => $title_tag ? $title_tag->attr('title') : undef,
price => $price_tag ? $price_tag->text : undef,
in_stock => $stock_tag
? index($stock_tag->text, 'In stock') >= 0
: 0,
};
}
printf "Found %d books\n", scalar @books;
use JSON::PP;
print JSON::PP->new->pretty->encode($books[0]);
4.Retiring LWP and Regex-Based Parsers
LWP::UserAgent without careful TLS configuration fails on sites that require SNI, HTTP/2, or modern cipher suites. HTML::TreeBuilder and HTML::Parser work, but they require more boilerplate than Mojo::DOM for the same result. Regex-based HTML parsing is fragile by definition — attribute order, whitespace, and encoding variations all cause silent failures in production.
The migration path is straightforward: swap LWP::UserAgent for Mojo::UserAgent (the API surface is similar enough for simple GET/POST scripts), replace HTML::TreeBuilder traversal with Mojo::DOM CSS selectors, and route bot-protected targets through OmniScrape. This extends the useful life of Perl glue scripts without a full rewrite. For background on what the API handles on your behalf, see the Cloudflare bypass guide.
If you cannot touch a legacy script this quarter, wrap it: let OmniScrape fetch the HTML and write it to a temp file, then feed that file to the existing LWP/regex pipeline unchanged. It buys time without a rewrite.
5.Calling OmniScrape from Mojo::UserAgent
Mojo::UserAgent's json => shortcut serializes a Perl hashref to JSON and sets Content-Type: application/json automatically. The response is decoded back to a Perl data structure via ->result->json. Set request_timeout to at least 120 seconds — js_rendering jobs can take 30–60 seconds on complex pages.
The response HTML lives at body->{data}{content}, not data.html. Check body->{success} before accessing it. The metadata.method_used field tells you whether OmniScrape used its fast HTTP path or a headless browser, which is useful for cost tracking and debugging.
123456789101112131415161718192021222324252627282930313233343536use strict;
use warnings;
use Mojo::UserAgent;
use Mojo::DOM;
my $ua = Mojo::UserAgent->new(request_timeout => 120);
my $tx = $ua->post(
'https://api.omniscrape.io/v1/scrape' => {
'X-API-Key' => $ENV{OMNISCRAPE_KEY},
'Content-Type' => 'application/json',
} => json => {
url => 'https://protected-shop.com/sku/771',
mode => 'auto',
output_format => 'html',
enable_solver => \1, # JSON true
proxy => 'residential:us',
}
);
my $res = $tx->result;
die "API HTTP " . $res->code . "\n" unless $res->is_success;
my $body = $res->json;
die "Scrape failed: " . ($body->{error} // 'unknown') . "\n"
unless $body->{success};
my $html = $body->{data}{content};
my $method = $body->{metadata}{method_used};
my $charged = $body->{billing}{charged};
my $dom = Mojo::DOM->new($html);
my $price = $dom->at('.product-price') ? $dom->at('.product-price')->text
: 'NOT FOUND';
printf "Price: %s (via %s, cost $%.4f)\n", $price, $method, $charged;
6.Non-Blocking Concurrent Fetches with Mojo::IOLoop
Mojo::UserAgent is fully non-blocking when used inside a Mojo::IOLoop. Passing a callback as the last argument to post() or get() switches the request to async mode. The event loop drives all requests concurrently in a single thread — no fork, no threads, no shared-memory complexity.
Cap parallelism deliberately. Ten simultaneous js_rendering jobs multiply cost linearly and can exhaust your API concurrency quota. Five concurrent requests is a reasonable default for most batch jobs; tune based on your OmniScrape plan limits. Track $active carefully — decrement it in every callback branch, including error paths, or the loop stalls.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960use strict;
use warnings;
use Mojo::UserAgent;
use Mojo::IOLoop;
my @urls = qw(
https://example.com/product/1
https://example.com/product/2
https://example.com/product/3
https://example.com/product/4
https://example.com/product/5
https://example.com/product/6
);
my $ua = Mojo::UserAgent->new(request_timeout => 120);
my $active = 0;
my $max = 5;
my @results;
sub scrape_next {
Mojo::IOLoop->stop and return unless @urls || $active;
return if $active >= $max || !@urls;
my $url = shift @urls;
$active++;
$ua->post(
'https://api.omniscrape.io/v1/scrape' => {
'X-API-Key' => $ENV{OMNISCRAPE_KEY},
} => json => {
url => $url,
mode => 'auto',
output_format => 'html',
enable_solver => \1,
} => sub {
my ($ua, $tx) = @_;
$active--;
my $body = eval { $tx->result->json } // {};
if ($body->{success}) {
my $html = $body->{data}{content};
# Parse with Mojo::DOM here
push @results, { url => $url, length => length($html) };
printf "OK %s (%d bytes)\n", $url, length($html);
} else {
warn "FAIL $url\n";
}
scrape_next();
}
);
# Kick off another request immediately if slots remain
scrape_next() if $active < $max && @urls;
}
Mojo::IOLoop->next_tick(\&scrape_next);
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;
printf "Completed %d/%d URLs\n", scalar @results, scalar @results + scalar @urls;
7.Structured Extraction Without DOM Parsing
When a cron script only needs three fields, css_extractor mode lets OmniScrape apply CSS selectors server-side and return a structured JSON object. You skip Mojo::DOM entirely and read directly from body->{data}{css_extracted}. This reduces response payload size and simplifies the Perl code.
css_selectors is a flat key-value map: keys are arbitrary field names you choose, values are CSS selector strings. The API returns the text content of the first matching element for each selector. If a selector matches nothing, the key is absent from css_extracted — guard with // undef when reading.
12345678910111213141516171819202122232425262728293031use strict;
use warnings;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new(request_timeout => 120);
my $tx = $ua->post(
'https://api.omniscrape.io/v1/scrape' => {
'X-API-Key' => $ENV{OMNISCRAPE_KEY},
} => json => {
url => 'https://example.com/product/771',
mode => 'auto',
output_format => 'css_extractor',
enable_solver => \1,
css_selectors => {
title => 'h1',
price => '.price-now',
sku => '[data-sku]',
rating => '.star-rating',
description => 'meta[name="description"]',
},
}
);
my $body = $tx->result->json;
die "Failed\n" unless $body->{success};
my $fields = $body->{data}{css_extracted};
printf "Title: %s\n", $fields->{title} // '(missing)';
printf "Price: %s\n", $fields->{price} // '(missing)';
printf "SKU: %s\n", $fields->{sku} // '(missing)';
8.JavaScript-Rendered SPAs Require js_rendering Mode
Mojo::UserAgent fetches the raw HTTP response — it does not execute JavaScript. React, Vue, and Angular applications that populate the DOM after page load return an empty shell to a plain HTTP client. For these targets, use mode: js_rendering in the OmniScrape request. The API runs a headless Chromium instance, executes the page JavaScript, and returns the fully rendered HTML.
Pair js_rendering with js_wait_selector to tell the browser to wait until a specific element appears before capturing the DOM. Without it, the snapshot may be taken before async data has loaded. See the scraping JavaScript-rendered pages guide for selector strategies and timeout tuning.
123456789101112my $tx = $ua->post(
'https://api.omniscrape.io/v1/scrape' => {
'X-API-Key' => $ENV{OMNISCRAPE_KEY},
} => json => {
url => 'https://spa-example.com/products',
mode => 'js_rendering',
output_format => 'html',
js_wait_selector => '.product-grid', # wait for this to appear
js_wait_timeout => 15, # seconds
enable_solver => \1,
}
);
9.Unicode Handling, Cron Hygiene, and Secret Management
Perl scraping scripts run unattended for years. Small oversights compound into 3 a.m. pages. Apply these rules at the start of every script and review them when onboarding a new target site.
- Add 'use utf8;' and 'use open qw(:std :utf8);' at the top of every script — this ensures source code and I/O use UTF-8 consistently without per-filehandle binmode calls
- Store OMNISCRAPE_KEY in the environment (via crontab's OMNISCRAPE_KEY=... line or a secrets manager) — never hardcode it in the script or commit it to version control
- Log billing.charged and billing.balance_after per run to a structured log file — this gives finance a per-job cost breakdown and alerts you when a pipeline suddenly gets expensive
- Chunk large URL lists and write progress to a state file between cron ticks — if the job dies at URL 4,000 of 10,000, resume from the checkpoint rather than restarting from zero
- Rotate User-Agent strings and add realistic request headers (Accept, Accept-Language, Referer) when calling targets directly — OmniScrape handles this automatically when you route through the API
- Set a hard wall-clock timeout in the cron entry itself (e.g., 'timeout 300 perl scrape.pl') so a stalled script does not block the next scheduled run
10.Robust Error Handling for Long-Running Cron Jobs
Exit non-zero on hard failures so cron's MAILTO mechanism emails you. Distinguish between errors that require immediate human attention and those that are safe to retry automatically. Log the full response body on unexpected status codes — a truncated error message is rarely enough to diagnose the root cause.
- 401 Unauthorized — die immediately with a clear message; the API key is wrong or revoked and retrying will not help
- 402 Payment Required — die and disable the crontab entry; top up the account balance before resuming
- 429 Too Many Requests — sleep for 30–60 seconds and retry once; if the second attempt also 429s, die and let cron reschedule
- 502/503/504 Gateway errors — retry up to 3 times with exponential backoff (5s, 15s, 45s) and a small random jitter to avoid thundering-herd on shared infrastructure
- success: false with a non-HTTP error — warn and continue the batch; log the url and error field so you can investigate the specific target offline
- Mojo::UserAgent connection timeout — treat as a transient network failure; retry once, then skip and log; do not die unless the entire batch is failing
- Malformed JSON response (eval {} around ->json) — log the raw body and skip; this indicates an upstream proxy issue, not a bug in your selectors
Frequently asked questions
Should I use Mojo::UserAgent or LWP::UserAgent for new Perl scraping projects?
Mojo::UserAgent for anything new. It handles modern TLS, HTTP/2 keep-alive, JSON encoding/decoding, cookies, and non-blocking I/O without additional modules. LWP::UserAgent is appropriate only when you are patching a script you cannot otherwise touch — and even then, wrapping it with OmniScrape for protected targets is easier than fixing LWP's TLS configuration.
Mojo::DOM or HTML::TreeBuilder — which should I use?
Mojo::DOM for new code. CSS selectors are shorter and more readable than TreeBuilder's traversal API, and Mojo::DOM is maintained as part of Mojolicious. Use HTML::TreeBuilder only if you have a large existing codebase that depends on it heavily and the migration cost is not justified.
Do I need a full Mojolicious web application to use Mojo::UserAgent?
No. Mojo::UserAgent is a standalone module. Import it with 'use Mojo::UserAgent;' in any Perl script — cron job, CLI tool, or daemon. A full Mojolicious app makes sense only if you need an admin interface, a webhook receiver, or a REST API on top of your scraping pipeline.
Is Perl still a viable choice for web scraping in production?
For maintaining existing pipelines, yes — particularly when OmniScrape handles modern bot protection externally, so the Perl script only needs to parse clean HTML. For greenfield projects where the team has a choice, most organizations pick Python (requests + BeautifulSoup/Playwright) or Go for easier hiring and broader library support. Perl is not a wrong choice; it is a narrower one.
Why not use WWW::Mechanize for scraping?
WWW::Mechanize simulates a browser from the pre-SPA era — it handles forms and links but has no JavaScript execution, struggles with modern TLS configurations, and does not support async I/O. For static sites it still works, but Mojo::UserAgent + OmniScrape covers the same ground and extends naturally to JavaScript-heavy and bot-protected targets.
How do I handle sites that require cookies or session state across requests?
Mojo::UserAgent maintains a cookie jar automatically across requests made by the same instance. For OmniScrape-routed requests, pass a session_id field in the request body — the API will reuse the same browser session and cookie state across calls that share the same session_id string.
What is the difference between mode auto and mode js_rendering?
mode auto tries a fast HTTP request first and escalates to a headless browser automatically if the response looks like a bot challenge or an empty SPA shell. mode js_rendering always uses a headless browser. Use auto as the default — it is cheaper when the fast path succeeds. Switch to js_rendering explicitly when you know the target always requires JavaScript execution and you want to avoid the auto-detection overhead.
Related guides