1.Maven dependencies
HttpClient is part of the Java standard library since Java 11 — no extra jar needed. Add Jsoup for HTML parsing and Jackson for deserializing OmniScrape JSON responses. Pin exact versions in pom.xml so CI builds are reproducible across environments.
If you are on Gradle, the group and artifact IDs are identical; only the syntax differs. Jsoup 1.18+ handles malformed HTML from real-world sites gracefully, including unclosed tags and mismatched charsets. Jackson's databind module pulls in core and annotations transitively.
1234567891011<!-- pom.xml -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.18.1</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.17.2</version>
</dependency>
2.Direct fetch with HttpClient
Build HttpClient once and reuse it — it manages connection pooling internally. Set an explicit connect timeout so your thread does not block indefinitely on a slow host. Use HttpResponse.BodyHandlers.ofString() for HTML; the charset is inferred from the Content-Type header.
Jsoup.connect() is convenient for quick experiments but bypasses HttpClient's connection pool and timeout configuration. For anything beyond a one-off script, go through HttpClient so the behavior is consistent when you later swap the transport for OmniScrape JSON payloads.
123456789101112131415161718192021HttpClient client = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(30))
.followRedirects(HttpClient.Redirect.NORMAL)
.build();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://books.toscrape.com/catalogue/page-1.html"))
.header("User-Agent", "Mozilla/5.0 (compatible; MyBot/1.0)")
.GET()
.build();
HttpResponse<String> response = client.send(
request, HttpResponse.BodyHandlers.ofString());
if (response.statusCode() != 200) {
throw new RuntimeException("Unexpected status: " + response.statusCode());
}
String html = response.body();
System.out.printf("Status %d, %d bytes%n",
response.statusCode(), html.length());
3.Extract rows with Jsoup
Jsoup.parse() accepts the raw HTML string and returns a Document. CSS selectors work the same way they do in a browser DevTools console — you can copy a selector from DevTools and use it directly. The text() method strips all child tags and returns concatenated visible text.
Always null-check selectFirst() results. A missing .price_color element means the page layout changed or you landed on an error page — not that the product is free. Fail loudly with a descriptive exception rather than silently storing null data that corrupts your dataset downstream.
1234567891011121314151617181920212223Document doc = Jsoup.parse(html, "https://books.toscrape.com");
List<Book> books = new ArrayList<>();
for (Element card : doc.select("article.product_pod")) {
Element titleEl = card.selectFirst("h3 a");
Element priceEl = card.selectFirst(".price_color");
Element stockEl = card.selectFirst(".instock");
if (titleEl == null || priceEl == null) {
System.err.println("Unexpected card structure — skipping");
continue;
}
String title = titleEl.attr("title");
String price = priceEl.text();
boolean inStock = stockEl != null && stockEl.text().contains("In stock");
books.add(new Book(title, price, inStock));
}
books.stream().limit(3).forEach(System.out::println);
record Book(String title, String price, boolean inStock) {}
4.Walk paginated catalogs
Increment the page counter until the server returns 404 or the product list is empty. Both conditions are valid termination signals — some sites return 200 with an empty body on the last page rather than 404.
Insert a polite delay between requests when hitting a site directly. Most public sites have rate limits; hammering them without delay is both inconsiderate and likely to get your IP blocked. When you route through OmniScrape, the API handles IP rotation for you and the delay can be reduced or removed.
12345678910111213141516171819202122232425262728293031323334353637List<Book> all = new ArrayList<>();
int page = 1;
while (true) {
String url = "https://books.toscrape.com/catalogue/page-" + page + ".html";
HttpResponse<String> r = client.send(
HttpRequest.newBuilder()
.uri(URI.create(url))
.GET()
.build(),
HttpResponse.BodyHandlers.ofString());
if (r.statusCode() == 404) {
System.out.println("Reached end of catalog at page " + page);
break;
}
Document doc = Jsoup.parse(r.body(), url);
Elements cards = doc.select("article.product_pod");
if (cards.isEmpty()) break;
for (Element card : cards) {
Element titleEl = card.selectFirst("h3 a");
Element priceEl = card.selectFirst(".price_color");
if (titleEl == null || priceEl == null) continue;
all.add(new Book(
titleEl.attr("title"),
priceEl.text(),
card.selectFirst(".instock") != null));
}
System.out.printf("Page %d scraped — %d books total%n", page, all.size());
page++;
Thread.sleep(2_000);
}
5.OmniScrape as a fetch service
When HttpClient gets a 403, a CAPTCHA page, or a Cloudflare challenge, POST the same URL to the OmniScrape API. The API handles IP rotation, browser fingerprinting, and challenge solving. Your Java code stays a thin HTTP client — it just changes the endpoint and adds an auth header.
Read the API key from an environment variable, never hardcode it. Set a generous timeout (90–120 seconds) because js_rendering mode spins a real browser on the server side. Parse the response with Jackson and pass data.content to Jsoup — your existing selectors work unchanged. See our Cloudflare bypass guide for what the API resolves at the network layer.
12345678910111213141516171819202122232425262728293031323334353637383940ObjectMapper mapper = new ObjectMapper();
String apiKey = System.getenv("OMNISCRAPE_KEY");
Map<String, Object> payload = Map.of(
"url", "https://protected-shop.com/item/441",
"mode", "auto",
"output_format", "html",
"enable_solver", true
);
HttpRequest omni = HttpRequest.newBuilder()
.uri(URI.create("https://api.omniscrape.io/v1/scrape"))
.header("Content-Type", "application/json")
.header("X-API-Key", apiKey)
.POST(HttpRequest.BodyPublishers.ofString(
mapper.writeValueAsString(payload)))
.timeout(Duration.ofSeconds(120))
.build();
HttpResponse<String> res = client.send(
omni, HttpResponse.BodyHandlers.ofString());
JsonNode root = mapper.readTree(res.body());
if (!root.path("success").asBoolean()) {
throw new RuntimeException("OmniScrape error: " + res.body());
}
String html = root.path("data").path("content").asText();
String methodUsed = root.path("metadata").path("method_used").asText();
boolean solverUsed = root.path("metadata").path("solver_used").asBoolean();
double charged = root.path("billing").path("charged").asDouble();
Document doc = Jsoup.parse(html);
String price = doc.selectFirst(".product-price") != null
? doc.selectFirst(".product-price").text()
: "not found";
System.out.printf("Price: %s | via %s | solver: %s | cost: $%.4f%n",
price, methodUsed, solverUsed, charged);
6.Skip Jsoup with css_extractor
For production pipelines that write directly to a database or message queue, parsing a full DOM in Java is unnecessary overhead. Set output_format to css_extractor and declare your selectors in the request body — OmniScrape evaluates them server-side and returns a flat key-value map in data.css_extracted.
This removes the Jsoup dependency from the hot path entirely. The response is already structured data ready for your repository layer. It also means selector logic lives in configuration rather than compiled code, making it easier to update when a site changes its markup without redeploying the service.
1234567891011121314151617181920212223242526272829303132333435Map<String, Object> selectors = Map.of(
"title", "h1.product-name",
"price", ".price-current",
"sku", "[data-sku]",
"rating", ".star-rating",
"description", "#product-description p"
);
Map<String, Object> payload = Map.of(
"url", targetUrl,
"mode", "auto",
"output_format", "css_extractor",
"enable_solver", true,
"css_selectors", selectors
);
HttpRequest req = HttpRequest.newBuilder()
.uri(URI.create("https://api.omniscrape.io/v1/scrape"))
.header("Content-Type", "application/json")
.header("X-API-Key", apiKey)
.POST(HttpRequest.BodyPublishers.ofString(
mapper.writeValueAsString(payload)))
.timeout(Duration.ofSeconds(120))
.build();
HttpResponse<String> res = client.send(
req, HttpResponse.BodyHandlers.ofString());
JsonNode root = mapper.readTree(res.body());
JsonNode extracted = root.path("data").path("css_extracted");
// extracted is a JSON object: { "title": "...", "price": "...", ... }
String title = extracted.path("title").asText();
String price = extracted.path("price").asText();
System.out.printf("Product: %s — %s%n", title, price);
7.Concurrent scrapes with virtual threads
Java 21 virtual threads (JEP 444) let you issue hundreds of blocking HttpClient calls without sizing a thread pool. Each virtual thread is cheap — the JVM parks it when it blocks on I/O and resumes it when data arrives. You get the readability of synchronous code with the throughput of async I/O.
Cap concurrency with a Semaphore to avoid flooding the OmniScrape API or the target site. Eight concurrent js_rendering jobs is a reasonable starting point; increase it based on your API plan and observed latency. Use a structured concurrency pattern so failures in individual tasks are surfaced, not silently swallowed.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859Semaphore sem = new Semaphore(8);
List<String> urls = loadUrlsFromDatabase(); // your source
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
List<Future<JsonNode>> futures = urls.stream()
.map(url -> executor.submit(() -> {
sem.acquire();
try {
return fetchStructured(client, mapper, apiKey, url);
} finally {
sem.release();
}
}))
.toList();
int success = 0, failed = 0;
for (Future<JsonNode> f : futures) {
try {
JsonNode extracted = f.get().path("data").path("css_extracted");
persist(extracted); // write to DB or queue
success++;
} catch (ExecutionException e) {
System.err.println("Task failed: " + e.getCause().getMessage());
failed++;
}
}
System.out.printf("Done — %d succeeded, %d failed%n", success, failed);
}
private static JsonNode fetchStructured(
HttpClient client, ObjectMapper mapper,
String apiKey, String url) throws Exception {
Map<String, Object> payload = Map.of(
"url", url,
"mode", "auto",
"output_format", "css_extractor",
"enable_solver", true,
"css_selectors", Map.of("title", "h1", "price", ".price")
);
HttpRequest req = HttpRequest.newBuilder()
.uri(URI.create("https://api.omniscrape.io/v1/scrape"))
.header("Content-Type", "application/json")
.header("X-API-Key", apiKey)
.POST(HttpRequest.BodyPublishers.ofString(
mapper.writeValueAsString(payload)))
.timeout(Duration.ofSeconds(120))
.build();
HttpResponse<String> res = client.send(
req, HttpResponse.BodyHandlers.ofString());
JsonNode root = mapper.readTree(res.body());
if (!root.path("success").asBoolean()) {
throw new RuntimeException("Failed for " + url + ": " + res.body());
}
return root;
}
8.JavaScript-rendered pages
Jsoup parses the HTML the server sends — it cannot execute JavaScript. React, Vue, and Angular apps typically ship an almost-empty HTML shell; the actual content is injected by the framework after scripts run. If view-source shows a nearly empty <div id='root'></div>, you need a real browser.
Use mode js_rendering and js_wait_selector to tell OmniScrape to wait until a specific element appears in the DOM before capturing the page. This is more reliable than a fixed wait time because it adapts to server latency. Combine it with css_extractor to get structured data back without parsing HTML in Java at all. More detail in scraping JavaScript-rendered pages.
1234567891011121314151617181920212223242526Map<String, Object> spaPayload = new LinkedHashMap<>();
spaPayload.put("url", "https://spa-store.com/products");
spaPayload.put("mode", "js_rendering");
spaPayload.put("output_format", "css_extractor");
spaPayload.put("js_wait_selector", ".product-card");
spaPayload.put("js_wait_timeout", 10_000);
spaPayload.put("css_selectors", Map.of(
"name", ".product-card h2",
"price", ".product-card .price",
"sku", ".product-card [data-sku]"
));
HttpRequest req = HttpRequest.newBuilder()
.uri(URI.create("https://api.omniscrape.io/v1/scrape"))
.header("Content-Type", "application/json")
.header("X-API-Key", apiKey)
.POST(HttpRequest.BodyPublishers.ofString(
mapper.writeValueAsString(spaPayload)))
.timeout(Duration.ofSeconds(120))
.build();
HttpResponse<String> res = client.send(
req, HttpResponse.BodyHandlers.ofString());
JsonNode root = mapper.readTree(res.body());
System.out.println(root.path("data").path("css_extracted").toPrettyString());
9.Error handling in production
Map every HTTP status and API error condition to a concrete action in your service layer. Swallowing exceptions or logging and continuing produces corrupt datasets that are harder to debug than a clean failure. Use a dead-letter mechanism — a database table, a Kafka topic, or a file — so failed URLs can be retried or investigated without re-running the entire job.
For transient errors (429, 502, 503), use exponential backoff with jitter. Resilience4j's Retry module integrates cleanly with Java's functional interfaces and works well alongside virtual threads. For permanent errors (401, 402), alert immediately — retrying will not help.
- 200 + success:false — log the full response body to a dead-letter topic; do not retry automatically without inspecting the error message
- 401 — invalid or missing API key; fix configuration and redeploy; do not retry
- 402 — account balance exhausted; pause the scheduler and notify the team responsible for billing
- 429 — rate limit hit; apply exponential backoff with jitter using Resilience4j Retry; reduce concurrency if it recurs
- 502 / 503 — upstream transient error; retry up to three times with backoff before moving to dead-letter
- HttpTimeoutException — increase timeout for js_rendering targets or reduce concurrency; log the URL for investigation
- JsonProcessingException — the response was not valid JSON; log the raw body; this usually indicates a network interception or proxy issue
Frequently asked questions
Should I use Jsoup or HtmlUnit for Java web scraping?
Jsoup for parsing HTML you have already fetched — it is fast, stable, and has an excellent CSS selector API. HtmlUnit simulates a browser in-process but breaks frequently on modern JavaScript-heavy sites and is slow to update when browser APIs change. For pages that genuinely require JavaScript execution, use OmniScrape with mode js_rendering rather than maintaining HtmlUnit in production.
OkHttp or java.net.http.HttpClient — which should I use?
Both work correctly with OmniScrape. HttpClient is built into the JDK since Java 11 and is sufficient for most scraping workloads — no extra dependency, no version conflicts. OkHttp is a reasonable choice if your organisation has already standardised on it or if you need its interceptor chain for cross-cutting concerns like logging and retry. Do not add OkHttp just for scraping if you are starting fresh.
How do I integrate this into a Spring Boot application?
Register a @Bean that produces a configured HttpClient and inject it wherever you need it. Trigger scraping jobs from @Scheduled methods or Kafka consumers — not from MVC request handlers that users are waiting on. For reactive stacks using WebFlux, wrap the blocking HttpClient.send() call in Mono.fromCallable(...).subscribeOn(Schedulers.boundedElastic()) or use virtual threads with a dedicated executor.
How do I handle encoding issues with international product names?
Jsoup detects the charset from the Content-Type header and meta charset tags automatically. The most common source of corrupted characters is writing bytes to a file or JDBC connection with the wrong charset. Explicitly set UTF-8 everywhere: Files.writeString(path, content, StandardCharsets.UTF_8), and configure your JDBC URL with characterEncoding=UTF-8. If names are still garbled, inspect the raw bytes before Jsoup parsing to confirm the source encoding.
When should I use mode auto versus js_rendering?
Start with mode auto — it attempts a fast HTTP fetch first and escalates to a headless browser automatically if the response indicates JavaScript is required. Use js_rendering explicitly only when you know the target always needs a browser and you want to skip the fast-lane attempt. Avoid specifying fast unless you are certain the page is static and you want to prevent any browser escalation. mode auto is the right default for most production workloads.
How do I reuse sessions across multiple requests to the same site?
Pass a session_id string in the OmniScrape request payload. Requests sharing the same session_id reuse the same browser context on the server side, which preserves cookies, login state, and local storage. Generate a UUID per scraping job or per user session in your application and include it consistently across all requests in that job.
What is the right concurrency level for virtual threads with OmniScrape?
Virtual threads remove the JVM-side bottleneck — you can have thousands without running out of OS threads. The real constraints are your OmniScrape API plan's concurrency limit and the target site's rate limits. Start with a Semaphore capped at 8 for js_rendering jobs and 20–30 for fast or auto jobs. Monitor billing.charged in responses and adjust based on observed throughput and error rates rather than guessing upfront.
Related guides