1.NuGet Packages and Project Setup
AngleSharp is the recommended HTML parser for modern .NET. It implements the WHATWG HTML5 parsing specification, so it handles malformed markup the same way browsers do. System.Text.Json ships in the BCL from .NET 6 onwards and is sufficient for deserializing OmniScrape responses — no need to add Newtonsoft.Json unless your project already depends on it.
For projects targeting .NET 6 or later, both HttpClient and System.Text.Json are available without additional packages. Add only AngleSharp explicitly.
1234dotnet add package AngleSharp
# HttpClient and System.Text.Json are included in the BCL (.NET 6+)
# Optional: add Polly for retry policies
dotnet add package Microsoft.Extensions.Http.Polly
2.Fetching Pages with HttpClient
For throwaway console tools or one-off scripts, a single static HttpClient instance shared across the process lifetime is acceptable. Set a realistic Timeout — the default 100-second timeout is too long for batch jobs and too short for slow CDN-backed pages. Add a User-Agent header; many servers return 403 to requests that omit it.
For ASP.NET Core services, hosted workers, or anything that runs longer than a single process invocation, skip this pattern entirely and use IHttpClientFactory covered in the next section. The factory manages handler lifetimes and avoids DNS staleness.
12345678910111213141516using System.Net.Http;
// Declare once at the class or program level — never inside a loop
private static readonly HttpClient _client = new HttpClient
{
Timeout = TimeSpan.FromSeconds(30),
DefaultRequestHeaders =
{
{ "User-Agent", "Mozilla/5.0 (compatible; MyBot/1.0)" },
},
};
var html = await _client.GetStringAsync(
"https://books.toscrape.com/catalogue/page-1.html");
Console.WriteLine($"Fetched {html.Length:N0} characters");
3.Parsing HTML with AngleSharp
BrowsingContext.New creates a parsing environment. Pass the raw HTML string via req.Content() inside OpenAsync — this avoids a second network call. QuerySelectorAll accepts any CSS selector string and returns an IHtmlCollection you can project with LINQ. Always call .Trim() on TextContent before persisting; whitespace around prices and titles is common.
GetAttribute returns null when the attribute is absent, so use the null-conditional operator throughout. If a selector changes on the target site, you get null values rather than an exception — log nulls explicitly so silent data gaps surface in monitoring.
1234567891011121314151617181920using AngleSharp;
using AngleSharp.Dom;
var context = BrowsingContext.New(Configuration.Default);
var document = await context.OpenAsync(req => req.Content(html));
var books = document.QuerySelectorAll("article.product_pod")
.Select(card => new
{
Title = card.QuerySelector("h3 a")?.GetAttribute("title")?.Trim(),
Price = card.QuerySelector(".price_color")?.TextContent.Trim(),
Rating = card.QuerySelector("p.star-rating")?.ClassName
?.Replace("star-rating", "").Trim(),
})
.Where(b => b.Title is not null)
.ToList();
Console.WriteLine($"Parsed {books.Count} books");
foreach (var book in books.Take(3))
Console.WriteLine($"{book.Title} — {book.Price} ({book.Rating} stars)");
4.IHttpClientFactory in ASP.NET Core
Register a typed client in Program.cs. The factory pools HttpMessageHandler instances and rotates them on a configurable interval (default two minutes), preventing both socket exhaustion and DNS staleness. Read the API key from configuration — in development use dotnet user-secrets, in production use Azure Key Vault via the Microsoft.Extensions.Configuration.AzureKeyVault provider.
Inject OmniScrapeClient into a BackgroundService or a Hangfire job, not directly into MVC controllers handling user requests. Scraping is slow and should never block a request thread. The typed client below reads body.Data.Content from the success response, which is where OmniScrape returns the fetched HTML.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556// Program.cs
builder.Services.AddHttpClient<OmniScrapeClient>(client =>
{
client.BaseAddress = new Uri("https://api.omniscrape.io/");
client.Timeout = TimeSpan.FromMinutes(2);
client.DefaultRequestHeaders.Add(
"X-API-Key",
builder.Configuration["OmniScrape:ApiKey"]);
});
// OmniScrapeClient.cs
public class OmniScrapeClient(HttpClient http)
{
public async Task<string> FetchHtmlAsync(
string url,
string mode = "auto",
CancellationToken ct = default)
{
var payload = new
{
url,
mode,
output_format = "html",
enable_solver = true,
};
using var response = await http.PostAsJsonAsync("v1/scrape", payload, ct);
response.EnsureSuccessStatusCode();
var body = await response.Content
.ReadFromJsonAsync<ScrapeResponse>(cancellationToken: ct)
?? throw new InvalidOperationException("Null response body");
if (!body.Success)
throw new ScrapeFailedException(url, body.Error);
// HTML content is in data.content
return body.Data.Content;
}
}
// ScrapeResponse.cs (record for System.Text.Json)
public record ScrapeResponse(
bool Success,
ScrapeData Data,
ScrapeMetadata Metadata,
string? Error);
public record ScrapeData(
string Content,
Dictionary<string, string>? CssExtracted);
public record ScrapeMetadata(
string MethodUsed,
bool SolverUsed,
bool ChallengeSolved);
5.Handling Bot-Protected Pages
Finance portals, retail sites, and travel aggregators commonly return 403 responses or serve JavaScript challenge pages to datacenter IP ranges. AngleSharp will parse the challenge page without error — your selectors simply return null, and you silently collect no data. Detect this by checking for known challenge fingerprints ("Just a moment", "cf-browser-verification") in the returned HTML before parsing.
For domains that consistently block direct HTTP, route requests through OmniScrapeClient with enable_solver: true. The OmniScrape Web Unlocker handles TLS fingerprinting, JavaScript challenge execution, and cookie management transparently. Read Cloudflare bypass for a detailed breakdown of the protection stack. The mode "auto" will attempt fast HTTP first and escalate to a headless browser automatically when a challenge is detected — no code change needed per domain.
12345678910111213// Detect challenge pages before parsing
private static bool IsChallengeResponse(string html) =>
html.Contains("cf-browser-verification", StringComparison.OrdinalIgnoreCase) ||
html.Contains("Just a moment", StringComparison.OrdinalIgnoreCase) ||
html.Length < 5_000; // suspiciously small for a product page
var html = await _client.GetStringAsync(targetUrl);
if (IsChallengeResponse(html))
{
// Fall back to OmniScrape with solver enabled
html = await omni.FetchHtmlAsync(targetUrl, mode: "auto", ct: ct);
}
6.Server-Side CSS Extraction with css_extractor
When you need a small set of fields from a page, use output_format: css_extractor and pass a css_selectors dictionary. OmniScrape evaluates the selectors server-side and returns a flat key-value map in data.css_extracted. This eliminates the AngleSharp parsing step entirely for simple cases and reduces the amount of HTML you need to transfer and process.
Map the extracted dictionary directly to a DTO or record. If a selector fails to match, the key is absent from the dictionary — handle that with TryGetValue rather than direct indexing to avoid KeyNotFoundException when a site changes its markup.
1234567891011121314151617181920212223242526272829var payload = new
{
url = "https://protected-shop.com/sku/441",
mode = "auto",
output_format = "css_extractor",
enable_solver = true,
css_selectors = new Dictionary<string, string>
{
["title"] = "h1.product-name",
["price"] = "span.price-current",
["availability"] = ".stock-status",
["sku"] = "meta[name='sku']@content",
},
};
using var response = await http.PostAsJsonAsync("v1/scrape", payload, ct);
response.EnsureSuccessStatusCode();
var json = await response.Content
.ReadFromJsonAsync<ScrapeResponse>(cancellationToken: ct);
var extracted = json!.Data.CssExtracted ?? new Dictionary<string, string>();
var product = new ProductDto(
Title: extracted.GetValueOrDefault("title", ""),
Price: extracted.GetValueOrDefault("price", ""),
Availability: extracted.GetValueOrDefault("availability", "unknown"),
Sku: extracted.GetValueOrDefault("sku", "")
);
7.Async Batch Scraping with SemaphoreSlim
Task.WhenAll fires all tasks concurrently. Without a concurrency cap, a list of 500 URLs will open 500 simultaneous connections — saturating your API quota and triggering rate limiting. SemaphoreSlim(n) limits in-flight requests to n at a time. Five is a reasonable starting point for OmniScrape; increase it based on your plan's rate limits.
Always await all the way through the call chain. Calling .Result or .GetAwaiter().GetResult() on ASP.NET threads can deadlock the synchronization context. Pass a CancellationToken from the host's ApplicationStopping event so in-progress batches drain cleanly on shutdown.
12345678910111213141516171819202122232425262728293031323334353637var urls = new[]
{
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3",
// ...
};
using var sem = new SemaphoreSlim(initialCount: 5, maxCount: 5);
var tasks = urls.Select(async url =>
{
await sem.WaitAsync(ct);
try
{
var html = await omni.FetchHtmlAsync(url, ct: ct);
return (url, html, error: (string?)null);
}
catch (Exception ex)
{
logger.LogWarning(ex, "Failed to fetch {Url}", url);
return (url, html: (string?)null, error: ex.Message);
}
finally
{
sem.Release();
}
});
var results = await Task.WhenAll(tasks);
var succeeded = results.Where(r => r.html is not null).ToList();
var failed = results.Where(r => r.error is not null).ToList();
logger.LogInformation(
"Batch complete: {Succeeded} succeeded, {Failed} failed",
succeeded.Count, failed.Count);
8.JavaScript-Rendered Pages and SPAs
AngleSharp parses static HTML — it does not execute JavaScript. React, Vue, and Blazor WebAssembly storefronts render their content client-side, so the raw HTML response contains only a shell div and script tags. Selectors against that HTML return null for every field.
Use mode: js_rendering to instruct OmniScrape to load the page in a headless Chromium instance. Set js_wait_selector to a CSS selector that appears only after the target content has rendered — this is more reliable than a fixed delay. js_wait_timeout is in milliseconds; 10 000 is a safe ceiling for most SPAs. See scraping JavaScript-rendered pages for detailed guidance on selector choice and session reuse.
123456789101112131415161718192021222324var payload = new
{
url = "https://spa-store.com/category/laptops",
mode = "js_rendering",
output_format = "html",
js_wait_selector = ".product-card",
js_wait_timeout = 10_000,
proxy = "residential:us",
};
using var response = await http.PostAsJsonAsync("v1/scrape", payload, ct);
response.EnsureSuccessStatusCode();
var body = await response.Content
.ReadFromJsonAsync<ScrapeResponse>(cancellationToken: ct);
// Log which rendering path was actually used
logger.LogInformation(
"method_used={Method} solver_used={Solver}",
body!.Metadata.MethodUsed,
body.Metadata.SolverUsed);
// HTML content is in data.content
var html = body.Data.Content;
9.HTTP Status and Error Handling
Distinguish transport-level HTTP errors from application-level scrape failures. EnsureSuccessStatusCode throws HttpRequestException for 4xx/5xx responses, but you also need to check body.Success for cases where the API returns 200 with a failure payload (e.g., the target site was unreachable).
Use Polly (via Microsoft.Extensions.Http.Polly) for retry logic on transient errors. Do not retry on 401 or 402 — those require operator intervention, not automatic retries.
- 401 Unauthorized — API key missing or invalid; fix Key Vault secret, do not retry automatically
- 402 Payment Required — account balance exhausted; pause the hosted service and alert ops
- 429 Too Many Requests — rate limit exceeded; apply Polly exponential backoff with jitter, respect Retry-After header
- 502 Bad Gateway — transient upstream error; retry up to three times with a short delay
- body.Success === false with HTTP 200 — the target URL was unreachable or returned an error; log and skip, do not feed into Polly
- HttpRequestException with timeout — increase client.Timeout for js_rendering requests; they take longer than fast HTTP fetches
- KeyNotFoundException on css_extracted — a selector stopped matching; alert and fall back to full HTML parsing
123456789101112131415161718192021// Polly retry policy registered at startup
builder.Services.AddHttpClient<OmniScrapeClient>()
.AddPolicyHandler(HttpPolicyExtensions
.HandleTransientHttpError()
.OrResult(r => r.StatusCode == HttpStatusCode.TooManyRequests)
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: (attempt, outcome, _) =>
{
// Honour Retry-After if present, else exponential backoff
if (outcome.Result?.Headers.RetryAfter?.Delta is { } delta)
return delta;
return TimeSpan.FromSeconds(Math.Pow(2, attempt));
},
onRetry: (outcome, delay, attempt, _) =>
{
Log.Warning(
"Retry {Attempt} after {Delay}s — {Reason}",
attempt, delay.TotalSeconds,
outcome.Exception?.Message ?? outcome.Result?.StatusCode.ToString());
}));
Frequently asked questions
AngleSharp or Html Agility Pack — which should I use?
AngleSharp for any greenfield .NET 6+ project. It implements the WHATWG HTML5 specification, supports CSS selectors natively, and integrates well with LINQ. Html Agility Pack is more tolerant of severely broken HTML and has a longer history in .NET, making it a reasonable choice when you're parsing legacy intranet pages or documents that predate modern HTML standards. For public web scraping, AngleSharp's spec compliance is an advantage — it parses pages the same way Chrome does.
Why not use Playwright or Puppeteer Sharp instead of OmniScrape?
Playwright is the right tool for authenticated workflows you control — filling forms, clicking through multi-step checkouts, or testing your own application. For scraping public bot-protected pages at scale, maintaining a headless browser fleet against adaptive bot vendors is a significant operational burden: fingerprint rotation, proxy management, CAPTCHA solving, and browser version updates all require ongoing work. OmniScrape handles that infrastructure. Use Playwright for flows that require session state you manage; use OmniScrape for public pages that block datacenter IPs.
Is Azure Functions a good host for a scraping workload?
Yes for scheduled jobs and event-driven triggers. Use the isolated worker model (not in-process) and register IHttpClientFactory via dependency injection in Program.cs. Set the function timeout in host.json above the worst-case js_rendering response time — allow at least 90 seconds. For high-volume continuous scraping, a Worker Service on Container Apps or AKS gives more control over concurrency and scaling than the Consumption plan.
Where should I store the OmniScrape API key in a .NET project?
In development: dotnet user-secrets set OmniScrape:ApiKey your-key-here. In CI/CD: environment variables injected at build time, not committed to source. In production on Azure: Azure Key Vault referenced via managed identity — no credentials in appsettings.json or Dockerfile. Never commit API keys to git; rotate immediately if one is exposed.
When should I use mode fast versus mode auto?
Default to auto in production. The auto mode tries fast HTTP-only fetching first and escalates to a headless browser only when it detects a challenge or JavaScript requirement. This keeps costs low for pages that don't need rendering while handling protected pages transparently. Use fast explicitly only when you have confirmed the target never requires JavaScript and you want to enforce the lower-cost path. Inspect metadata.method_used in API responses to understand what each domain actually requires.
How do I handle pagination across hundreds of pages efficiently?
Build the URL list upfront if the pagination pattern is predictable (e.g., ?page=1 through ?page=200), then run the SemaphoreSlim batch pattern. For sites with cursor-based or next-link pagination, chain requests sequentially: parse the next-page link from each response before queuing the next request. Use session_id in the OmniScrape request body to reuse a browser session across paginated requests on JavaScript-heavy sites — this avoids repeated challenge solving and reduces latency.
How do I deserialize OmniScrape responses with System.Text.Json?
Use JsonPropertyName attributes or configure JsonSerializerOptions with JsonNamingPolicy.SnakeCaseLower (.NET 8+) to map snake_case JSON fields to PascalCase C# properties. The response shape is: body.success (bool), body.data.content (HTML string), body.data.css_extracted (Dictionary<string,string> when using css_extractor), and body.metadata.method_used (string). The correct field for HTML content is data.content — there is no data.html field in the OmniScrape response.
Related guides