Web Scraping with Go (Golang)

1.Module setup and dependencies

Initialise a module, then pull in `goquery`. It transitively brings in `golang.org/x/net/html`, which is the actual HTML5 parser. `resty` is optional — it reduces JSON-API boilerplate but adds a dependency. Decide once per project and stay consistent.

Pin versions in `go.sum` before committing. Floating `latest` in CI causes silent breakage when upstream tags a new major. Run `go mod tidy` after every dependency change to keep the graph clean.

terminal

bash

1234go mod init example.com/scraper
go get github.com/PuerkitoBio/goquery
# optional HTTP client sugar:
go get github.com/go-resty/resty/v2

2.Fetch pages with net/http

Always thread `context.Context` through every HTTP call. A scraper without cancellation leaks goroutines when a slow target stalls — the goroutine blocks on `resp.Body.Read` indefinitely. `context.WithTimeout` is the cheapest insurance you have.

Set a realistic `User-Agent`. Many CDNs reject requests with Go's default `Go-http-client/2.0` string before even evaluating the page. A plausible browser UA does not defeat bot detection on its own, but it avoids trivial rejections on lightly-protected sites.

Read the full body with `io.ReadAll` and close the response body in a `defer`. Failing to drain and close the body prevents connection reuse in `http.DefaultClient`'s transport pool, which degrades throughput on high-volume crawls.

fetch.go

1234567891011121314151617181920212223242526272829303132333435package main

import (
    "context"
    "io"
    "log"
    "net/http"
    "time"
)

func fetchPage(ctx context.Context, target string) ([]byte, error) {
    ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, "GET", target, nil)
    if err != nil {
        return nil, err
    }
    req.Header.Set("User-Agent",
        "Mozilla/5.0 (compatible; GoScraper/1.0)")
    req.Header.Set("Accept-Language", "en-US,en;q=0.9")

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        return nil, err
    }
    log.Printf("GET %s -> %d (%d bytes)", target, resp.StatusCode, len(body))
    return body, nil
}

3.Parse HTML with goquery

`goquery.NewDocumentFromReader` accepts any `io.Reader`. Wrap the raw response bytes in `bytes.NewReader` and you have a parsed document ready for CSS selectors. `Find()` returns a `*Selection`; call `Each()` to iterate matched nodes.

`Text()` returns the concatenated text content of the node and all descendants — trim whitespace with `strings.TrimSpace`. `Attr()` returns the attribute value and a boolean indicating presence; always check the boolean when the attribute is optional.

Avoid deeply nested `Find` chains inside `Each` callbacks. Prefer a flat selector that targets the leaf element directly — it is both faster and easier to maintain when the site updates its markup.

parse.go

12345678910111213141516171819202122232425262728293031323334353637package main

import (
    "bytes"
    "log"
    "strings"

    "github.com/PuerkitoBio/goquery"
)

type Book struct {
    Title string
    Price string
    URL   string
}

func parseBooks(body []byte) []Book {
    doc, err := goquery.NewDocumentFromReader(bytes.NewReader(body))
    if err != nil {
        log.Fatal(err)
    }

    var books []Book
    doc.Find("article.product_pod").Each(func(_ int, s *goquery.Selection) {
        title, _ := s.Find("h3 a").Attr("title")
        price := strings.TrimSpace(s.Find(".price_color").Text())
        href, _ := s.Find("h3 a").Attr("href")
        books = append(books, Book{
            Title: title,
            Price: price,
            URL:   "https://books.toscrape.com/catalogue/" + href,
        })
    })

    log.Printf("parsed %d books", len(books))
    return books
}

4.Concurrent scraping with a goroutine worker pool

Fan out requests across a fixed number of goroutines using buffered channels. The `jobs` channel carries URLs; the `results` channel carries structured output. A `sync.WaitGroup` is not needed here because the result count equals the job count — reading `len(urls)` results from the channel is sufficient synchronisation.

Cap workers at 5–10 for external targets unless you have confirmed rate-limit headroom. More goroutines does not mean more throughput when the bottleneck is the remote server or your egress bandwidth. For OmniScrape calls the practical limit is your account's concurrent-request allowance.

Pass the API key via environment variable — never hardcode credentials. `os.Getenv` at startup, fail fast if empty.

pool.go

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465package main

import (
    "context"
    "log"
    "os"
    "strings"

    "github.com/PuerkitoBio/goquery"
)

type Result struct {
    URL  string
    HTML string
    Err  error
}

func worker(
    ctx context.Context,
    jobs <-chan string,
    results chan<- Result,
    apiKey string,
) {
    for url := range jobs {
        html, err := fetchOmniScrape(ctx, apiKey, url)
        results <- Result{URL: url, HTML: html, Err: err}
    }
}

func main() {
    apiKey := os.Getenv("OMNISCRAPE_KEY")
    if apiKey == "" {
        log.Fatal("OMNISCRAPE_KEY not set")
    }

    urls := []string{
        "https://example.com/product/1",
        "https://example.com/product/2",
        "https://example.com/product/3",
    }

    jobs := make(chan string, len(urls))
    results := make(chan Result, len(urls))

    ctx := context.Background()
    const workers = 5
    for w := 0; w < workers; w++ {
        go worker(ctx, jobs, results, apiKey)
    }

    for _, u := range urls {
        jobs <- u
    }
    close(jobs)

    for range urls {
        r := <-results
        if r.Err != nil {
            log.Printf("FAIL %s: %v", r.URL, r.Err)
            continue
        }
        doc, _ := goquery.NewDocumentFromReader(strings.NewReader(r.HTML))
        log.Printf("OK   %s | h1: %q", r.URL, strings.TrimSpace(doc.Find("h1").First().Text()))
    }
}

5.OmniScrape API integration with net/http

Marshal a request struct to JSON, POST it to `https://api.omniscrape.io/v1/scrape` with your `X-API-Key` header, then decode the response. On bot-protected retail, news, or travel sites this replaces your direct `GET` entirely — the API handles TLS fingerprinting, browser emulation, and CAPTCHA solving upstream.

The response HTML lives in `data.content`. Check `success` before accessing `data` — on failure the API returns a structured error body rather than a 4xx status, so a non-nil decode error does not imply failure. Log `metadata.method_used` to understand whether the API escalated to a headless browser; that affects your billing and latency expectations.

For deeper context on what happens when the API encounters a Cloudflare challenge, see Cloudflare bypass.

omniscrape.go

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970package main

import (
    "bytes"
    "context"
    "encoding/json"
    "fmt"
    "net/http"
)

type ScrapeRequest struct {
    URL          string `json:"url"`
    Mode         string `json:"mode"`
    OutputFormat string `json:"output_format"`
    EnableSolver bool   `json:"enable_solver,omitempty"`
}

type ScrapeResponse struct {
    Success bool `json:"success"`
    Data    struct {
        Content string `json:"content"`
    } `json:"data"`
    Metadata struct {
        MethodUsed      string `json:"method_used"`
        SolverUsed      bool   `json:"solver_used"`
        ChallengeSolved bool   `json:"challenge_solved"`
    } `json:"metadata"`
    Billing struct {
        Charged      float64 `json:"charged"`
        BalanceAfter float64 `json:"balance_after"`
    } `json:"billing"`
    Error string `json:"error,omitempty"`
}

func fetchOmniScrape(ctx context.Context, apiKey, target string) (string, error) {
    payload, err := json.Marshal(ScrapeRequest{
        URL:          target,
        Mode:         "auto",
        OutputFormat: "html",
        EnableSolver: true,
    })
    if err != nil {
        return "", err
    }

    req, err := http.NewRequestWithContext(ctx, "POST",
        "https://api.omniscrape.io/v1/scrape", bytes.NewReader(payload))
    if err != nil {
        return "", err
    }
    req.Header.Set("Content-Type", "application/json")
    req.Header.Set("X-API-Key", apiKey)

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()

    var out ScrapeResponse
    if err := json.NewDecoder(resp.Body).Decode(&out); err != nil {
        return "", fmt.Errorf("decode error: %w", err)
    }
    if !out.Success {
        return "", fmt.Errorf("scrape failed for %s: %s", target, out.Error)
    }

    // HTML is in out.Data.Content
    return out.Data.Content, nil
}

6.OmniScrape call with resty

resty reduces JSON API boilerplate: set the base URL, default headers, and timeout once on the client, then reuse it across all requests. `SetResult` unmarshals the response body directly into your struct — no manual `json.NewDecoder` call needed.

The `css_extractor` output format lets the API extract structured data server-side, returning a `css_extracted` map instead of raw HTML. This is more efficient than fetching full HTML and parsing locally when you only need a handful of fields.

resty.go

12345678910111213141516171819202122232425262728293031323334353637383940414243package main

import (
    "fmt"
    "log"
    "os"
    "time"

    "github.com/go-resty/resty/v2"
)

func fetchWithResty(target string) (*ScrapeResponse, error) {
    client := resty.New().
        SetBaseURL("https://api.omniscrape.io").
        SetHeader("X-API-Key", os.Getenv("OMNISCRAPE_KEY")).
        SetTimeout(2 * time.Minute)

    var result ScrapeResponse
    _, err := client.R().
        SetBody(map[string]any{
            "url":           target,
            "mode":          "auto",
            "output_format": "css_extractor",
            "enable_solver": true,
            "proxy":         "residential:us",
            "css_selectors": map[string]string{
                "title":       "h1",
                "price":       "[data-price]",
                "description": ".product-description p",
            },
        }).
        SetResult(&result).
        Post("/v1/scrape")
    if err != nil {
        return nil, err
    }
    if !result.Success {
        return nil, fmt.Errorf("scrape failed: %s", result.Error)
    }
    log.Printf("method_used=%s charged=%.4f",
        result.Metadata.MethodUsed, result.Billing.Charged)
    return &result, nil
}

7.Colly for link discovery, OmniScrape for protected fetches

Colly is well-suited for crawling link graphs on open sites — its `OnHTML` callbacks, politeness delays, and revisit tracking save real implementation time. Use it to discover product, article, or listing URLs from sitemaps and paginated indexes.

On protected detail pages, do not expect Colly's default HTTP client to survive Akamai, PerimeterX, or Cloudflare Bot Management. The fingerprint is wrong at the TLS layer before any application-level header is evaluated. The right pattern: use Colly to collect URLs, then feed those URLs to `fetchOmniScrape` — either directly or via the worker pool above.

You can hook OmniScrape into Colly's custom transport by implementing `http.RoundTripper`, but it is simpler to keep the two concerns separate: Colly owns crawl state and URL deduplication; OmniScrape owns the actual fetch for protected targets.

8.JavaScript-rendered pages with js_rendering mode

goquery operates on the raw HTML returned by the server. Single-page applications that render content client-side via React, Vue, or similar frameworks return a near-empty HTML shell — goquery will find nothing useful. For these targets, use `mode: js_rendering` which runs a headless browser upstream.

Pair `js_rendering` with `js_wait_selector` to tell the browser to wait until a specific element is present in the DOM before capturing the page. Without it, the snapshot may be taken before the async data fetch completes. `js_wait_timeout` sets the maximum wait in milliseconds before the API gives up and returns whatever is rendered so far.

Full walkthrough with pagination and infinite scroll: scraping JavaScript-rendered pages.

js_rendering.go

123456789101112131415161718192021222324252627282930313233343536373839404142package main

import (
    "bytes"
    "context"
    "encoding/json"
    "fmt"
    "net/http"
    "os"
)

func fetchSPA(ctx context.Context, target string) (string, error) {
    payload, _ := json.Marshal(map[string]any{
        "url":              target,
        "mode":             "js_rendering",
        "output_format":    "html",
        "js_wait_selector": ".product-card",
        "js_wait_timeout":  10000,
        "enable_solver":    true,
    })

    req, _ := http.NewRequestWithContext(ctx, "POST",
        "https://api.omniscrape.io/v1/scrape", bytes.NewReader(payload))
    req.Header.Set("Content-Type", "application/json")
    req.Header.Set("X-API-Key", os.Getenv("OMNISCRAPE_KEY"))

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()

    var out ScrapeResponse
    if err := json.NewDecoder(resp.Body).Decode(&out); err != nil {
        return "", fmt.Errorf("decode error: %w", err)
    }
    if !out.Success {
        return "", fmt.Errorf("js_rendering failed: %s", out.Error)
    }
    // HTML content is in out.Data.Content
    return out.Data.Content, nil
}

9.Error handling, retries, and observability

A production scraper needs more than a single retry loop. Structure your error handling around the response characteristics so you spend credits only on requests that have a realistic chance of succeeding on retry.

Key rules for OmniScrape error handling:

Propagate context cancellation immediately — if the parent context is done, stop retrying and return the context error
Retry HTTP 502/503/504 with exponential backoff and ±20% jitter; cap at 3 attempts
Never retry 401 (bad API key) or 402 (insufficient credits) — these require operator intervention
Check success:false in the response body even on HTTP 200 — the API returns structured errors this way
Log metadata.method_used and billing.charged per request; aggregate in your metrics system to track cost per domain
Route success:false URLs to a dead-letter file or queue for offline inspection and manual replay
Set a custom http.Transport with MaxIdleConnsPerHost tuned to your worker count to avoid connection exhaustion on high-concurrency runs

Frequently asked questions

Should I use net/http or resty for OmniScrape calls?

net/http keeps your dependency graph minimal and is the right choice for dedicated scraper binaries or CLI tools. resty is ergonomic when you are already using it elsewhere in the codebase and want to avoid manual JSON encode/decode boilerplate. Both work identically against the OmniScrape API — the choice is a style preference, not a correctness issue.

Should I use Colly for everything?

Use Colly for crawling open sites where you need link graph traversal, politeness delays, and URL deduplication out of the box. For protected pages — anything behind Cloudflare, Akamai, or PerimeterX — route fetches through OmniScrape regardless of which crawl framework you use. Colly's HTTP client cannot survive modern bot management at the TLS fingerprint level.

How do I limit memory when scraping large pages?

If you only need structured fields, use output_format: css_extractor instead of html. The API extracts data server-side and returns a small JSON map — you never allocate the full HTML string in your process. For HTML output, avoid accumulating all pages in a slice; process and discard each result as it arrives from the results channel.

When should I use http.DefaultClient vs a custom Transport?

http.DefaultClient is fine for low-concurrency scrapers. When running 20+ concurrent OmniScrape calls from one process, create a custom http.Transport with MaxIdleConnsPerHost set to your worker count. Without this, the default limit of 2 idle connections per host causes excessive TCP connection churn and adds measurable latency.

What is the difference between mode auto and mode fast?

mode auto is the default and preferred choice — it tries a lightweight HTTP fetch first and escalates to a headless browser only if the response indicates a challenge or empty render. mode fast skips the escalation logic entirely and returns whatever the HTTP response contains. Use fast only when you have confirmed the target is static and unprotected, and you want to minimise latency and cost.

How do I handle pagination in a Go scraper?

For simple numeric pagination, generate URLs in a loop and push them into the jobs channel. For cursor- or token-based pagination, process each result synchronously — extract the next-page token from the HTML using goquery, then enqueue the next URL. Avoid recursion; use an explicit queue (a slice or channel) to track pending pages and a visited map to prevent cycles.

Can I use session_id to maintain state across requests?

Yes. Pass a session_id string in the request body. The OmniScrape API will reuse the same browser context for subsequent requests with the same session ID, preserving cookies and local storage. This is useful for sites that require a login flow before reaching the target page. Generate a unique session ID per scrape job, not per request.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.