OmniScrape
ProductsSolutionsGuidesDocs ↗PricingAbout
ProductsSolutionsGuidesDocs ↗PricingAbout
← All guides
Site-Specific Scrapers

Reddit Scraper: Posts, Comments, and Subreddit Data

Reddit is one of the richest public sources of unfiltered opinion — product feedback, employer reviews, niche market signals, and early trend detection all live in its threads. For developers building sentiment pipelines or brand monitors, the challenge is not finding the data but fetching it reliably under Reddit's rate limits and evolving access policies.

Since the 2023 API pricing overhaul ended free commercial access, most teams have shifted to two approaches: the public `.json` suffix on thread URLs (no OAuth required for read-only public data at moderate volume) and HTML scraping of `old.reddit.com`, which serves a stable server-rendered page that does not require JavaScript. This guide covers both paths — URL patterns, HTML selectors, pagination mechanics, and complete OmniScrape request examples. For broader social media archiving patterns, see social media web scraping. For downstream processing, see sentiment analysis web scraping.

On this page

1. Reddit data fields social listening teams extract2. Reddit URL patterns for scraping3. The .json suffix: structured data without HTML parsing4. old.reddit HTML structure and CSS selectors5. Rate limits, access controls, and bot detection6. Fetch a Reddit thread as JSON with OmniScrape7. Scrape old.reddit HTML with CSS extraction8. Comment pagination and large thread handling9. Reddit API terms, researcher access, and data handling10. FAQ

1.Reddit data fields social listening teams extract

The fields worth collecting depend on your use case. Brand monitors care about mention volume, sentiment signals in post body and comments, and the score as a proxy for community endorsement. Researchers archiving threads for longitudinal studies need stable identifiers, timestamps, and the full comment tree with parent references so they can reconstruct conversation structure offline.

Below is a practical field inventory covering posts, comments, and subreddit-level metadata. Not all fields are available on every surface — the `.json` endpoint exposes the most complete set, while HTML scraping gives you the visible subset rendered on the page.

  • Post id, title, selftext (body) or external link URL, author handle
  • Score (net upvotes), upvote_ratio, num_comments
  • created_utc timestamp and post flair text/color
  • Subreddit name, subscriber count (subscribers), description, rules summary
  • Comments: body, author, score, depth level, parent_id, created_utc
  • Awards and gilding counts when visible in JSON (gilded, all_awardings)
  • over_18 (NSFW) flag and quarantine flag at subreddit and post level
  • Crosspost parent list (crosspost_parent_list) for tracking reposts
  • Distinguished field (moderator/admin labels) and stickied flag
  • Removal reason or removed_by_category for moderation research

2.Reddit URL patterns for scraping

Reddit's URL structure is consistent and predictable, which makes it straightforward to construct target URLs programmatically. The `.json` suffix works on most public listing and thread URLs — append it directly before any query string parameters. `old.reddit.com` mirrors the same path structure as `www.reddit.com` but serves static HTML instead of the React SPA.

When building a crawler, generate thread URLs from subreddit listing pages (e.g., `/r/python/new.json?limit=100`) and then fetch individual thread JSON for full comment data. Subreddit listing endpoints support `after` and `before` pagination cursors via the `after=t3_postid` query parameter.

  • Post page: https://www.reddit.com/r/python/comments/abc123/title_slug/
  • Post JSON: https://www.reddit.com/r/python/comments/abc123/title_slug.json
  • Subreddit hot listing: https://www.reddit.com/r/MachineLearning/hot.json?limit=100
  • Subreddit new listing: https://www.reddit.com/r/MachineLearning/new.json?limit=100
  • Subreddit top (all time): https://www.reddit.com/r/MachineLearning/top.json?t=all&limit=100
  • Pagination cursor: append &after=t3_postid to any listing URL
  • old.reddit thread: https://old.reddit.com/r/python/comments/abc123/
  • old.reddit subreddit: https://old.reddit.com/r/datascience/new/
  • User profile (public): https://www.reddit.com/user/username/submitted.json
  • Search results: https://www.reddit.com/search.json?q=keyword&sort=new&limit=100

3.The .json suffix: structured data without HTML parsing

Appending `.json` to a Reddit thread URL returns a two-element array: `[0]` is the post listing (one item), and `[1]` is the comment listing with a nested `children` array. Each comment is a `t1` kind object; the special `more` kind signals additional comments not yet loaded. This structure lets you reconstruct the full comment tree using `parent_id` references without any HTML parsing.

The response shape for a post is `data.children[0].data` — fields like `title`, `selftext`, `score`, `upvote_ratio`, `num_comments`, `created_utc`, and `author` are all present. Comments are at `[1].data.children`, each with `data.body`, `data.author`, `data.score`, `data.depth`, and `data.parent_id`.

Rate limits are enforced per IP. Reddit expects a descriptive `User-Agent` header identifying your application — the format from their API wiki is `platform:appname:version (by /u/yourusername)`. Without a proper User-Agent, requests are more aggressively rate-limited. Deleted comments show `[deleted]` as author and `[removed]` as body — preserve these markers in your schema rather than dropping the records, since the comment's position in the tree still carries structural meaning.

For subreddit listing endpoints, the response is a single listing object: `data.children` is an array of post objects, and `data.after` is the pagination cursor for the next page. Set `limit=100` (Reddit's maximum per request) to minimize the number of round trips needed to cover a subreddit's history.

4.old.reddit HTML structure and CSS selectors

`old.reddit.com` uses a stable, server-rendered HTML structure that has changed minimally over many years. Each post on a listing page is a `div.thing` element with data attributes (`data-fullname`, `data-score`, `data-comments-count`, `data-subreddit`) that are often more reliable than parsing visible text. On a thread page, the post itself is `div.thing.link` and comments are `div.thing.comment` nested inside `div.child` containers.

Key selectors for post listings: `a.title` for the post title and link, `time.live-timestamp` (with `datetime` attribute) for the ISO timestamp, `div.score.unvoted` for the score, `a.author` for the username, `a.subreddit` for the subreddit name, and `a.comments` for the comment count and thread link.

For thread pages, the post body is inside `div.usertext-body > div.md`. Comment bodies are `div.thing.comment div.usertext-body > div.md` — but because comments nest arbitrarily deep, a flat CSS selector will return all comments at all depths. If you need depth information, parse the nesting level from the `div.child` hierarchy or read the `data-depth` attribute on `div.thing.comment`.

new.reddit renders comment threads client-side via React. Fetching `www.reddit.com` thread pages without JavaScript execution returns a nearly empty shell. Always prefer `old.reddit.com` or the `.json` endpoint for scraping — only fall back to `js_rendering` mode if you specifically need new Reddit's UI or features not available elsewhere.

5.Rate limits, access controls, and bot detection

Reddit does not sit behind Cloudflare for most public pages, but it runs its own rate limiting and bot detection. The primary signals Reddit uses are request rate per IP, missing or generic User-Agent strings, and behavioral patterns like fetching sequential post IDs without any variation. Datacenter IPs are more aggressively limited than residential addresses.

The 2023 API changes introduced paid tiers for the official API. The public `.json` endpoints remain accessible without OAuth for read-only access to public subreddits, but Reddit has tightened enforcement. NSFW and quarantined subreddits require a logged-in session with age verification — you cannot scrape these with anonymous requests. Private subreddits require membership and cannot be accessed without authentication.

For high-volume collection, residential proxy rotation is the most effective mitigation for IP-based rate limits. Spread requests across subreddits and time rather than hammering a single endpoint. If you are building a production pipeline, consider the official Data API for large-scale commercial use — it is the only path that is clearly within Reddit's terms for commercial products.

  • 429 Too Many Requests from rapid unauthenticated requests — back off and rotate IPs
  • OAuth required for official API; free tier removed for commercial use in 2023
  • NSFW and quarantined subreddits require authenticated session with age confirmation
  • new.reddit thread pages require JavaScript rendering for full comment load
  • Shadowbanned users' posts appear deleted in HTML; their JSON entries show removed status
  • Missing or bot-like User-Agent strings trigger faster rate limiting
  • Sequential ID crawling patterns are more likely to trigger blocks than organic browsing patterns

6.Fetch a Reddit thread as JSON with OmniScrape

Use OmniScrape to fetch the `.json` URL for a Reddit thread. The response arrives in `body.data.content` as a JSON string — parse it in your worker to access the post and comment arrays. `mode: "auto"` is sufficient for most Reddit JSON endpoints since they are server-rendered responses; OmniScrape will use the fast HTTP lane and only escalate to a browser if needed.

Use a residential US proxy to reduce the chance of hitting IP-based rate limits. If you need to pass a custom `User-Agent` header to satisfy Reddit's API rules, add it via the `custom_headers` field.

Fetch Reddit thread JSON
json
123456789{
  "url": "https://www.reddit.com/r/webscraping/comments/1a2b3c4/how_do_you_handle_rate_limits.json",
  "mode": "auto",
  "output_format": "html",
  "proxy": "residential:us",
  "custom_headers": {
    "User-Agent": "web:omniscrape-reddit-example:1.0 (by /u/your_reddit_username)"
  }
}

7.Scrape old.reddit HTML with CSS extraction

When you need specific visible fields from a thread without parsing the full JSON tree, CSS extraction against `old.reddit.com` is efficient. OmniScrape runs the selectors server-side and returns only the matched values in `body.data.css_extracted` — no need to download and parse the full HTML in your worker.

The selectors below target the post-level fields on a thread page. For comment bodies, add a multi-match selector for `div.thing.comment div.usertext-body > div.md` — the response will include an array of all matched elements. Note that `div.score.unvoted` may render as `div.score.likes` or `div.score.dislikes` depending on vote state; `div.score` alone is a safer selector if you do not need vote direction.

old.reddit CSS extraction request
json
12345678910111213141516171819{
  "url": "https://old.reddit.com/r/datascience/comments/xyz789/weekly_thread/",
  "mode": "auto",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "custom_headers": {
    "User-Agent": "web:omniscrape-reddit-example:1.0 (by /u/your_reddit_username)"
  },
  "css_selectors": {
    "title": "a.title",
    "score": "div.score",
    "author": "a.author",
    "subreddit": "a.subreddit",
    "comment_count": "a.comments",
    "post_body": "div.usertext-body > div.md",
    "timestamp": "time.live-timestamp",
    "flair": "span.flair"
  }
}

8.Comment pagination and large thread handling

Reddit truncates comment trees in the `.json` response for large threads. The default response includes up to 200 top-level comments and collapses deep subtrees into `more` objects — these contain a list of child IDs that must be fetched separately via the `morechildren` API endpoint (`https://www.reddit.com/api/morechildren.json`). Without OAuth, `morechildren` is rate-limited more aggressively than regular thread fetches.

For most brand monitoring and sentiment analysis use cases, the first-page comment response is sufficient. Set `?limit=500&depth=10` on the thread JSON URL to maximize the comments returned in a single request — Reddit caps this, but you will get significantly more than the default. For threads with thousands of comments, plan for multiple `morechildren` fetches and implement exponential backoff on 429 responses.

When archiving at scale, prioritize breadth over depth: collect post metadata and top-level comments across many threads rather than exhaustively expanding every `more` object in a single viral thread. Store `more` IDs in your queue for later expansion if completeness is required. See social media web scraping for archiving pipeline patterns and compliance considerations.

For subreddit history collection, use the listing endpoint pagination cursor (`after=t3_postid`) to walk backwards through time. Reddit listing endpoints only expose approximately 1,000 posts per sort (the Pushshift API, now restricted, was the common workaround for deeper history — check current availability and terms before relying on it).

9.Reddit API terms, researcher access, and data handling

Reddit's Developer Terms distinguish between personal use, academic research, and commercial use. The 2023 policy changes removed the free commercial tier from the official API — large-scale commercial data products require a paid Data API agreement. Scraping public HTML does not automatically exempt a product from these terms; Reddit's ToS covers data collection methods beyond just the official API.

Academic researchers have access to separate programs including the Academic Research API (check Reddit's current developer portal for availability and eligibility). If you are building a research tool for a university or non-profit, this is the correct path rather than HTML scraping at scale.

Regardless of access method, handle scraped Reddit data responsibly: do not deanonymize pseudonymous users by correlating Reddit handles with real identities, do not use comment history to harass or profile individuals, and respect deletion — if a user deletes their post or account, remove that content from your dataset in accordance with your data retention policy. GDPR and similar regulations may apply if your users or data subjects are in covered jurisdictions, even though Reddit is a US platform.

Store only the fields your use case requires. Retaining full comment histories for users who have since deleted their accounts creates legal and ethical exposure. Build deletion propagation into your pipeline from the start rather than retrofitting it later.

Frequently asked questions

Should I use old.reddit.com or the .json endpoint for scraping?

The `.json` endpoint is almost always the better choice: it gives you structured data with all fields including ones not rendered in HTML (upvote_ratio, created_utc as a Unix timestamp, crosspost data, moderation flags), and you avoid HTML parsing entirely. Use `old.reddit.com` HTML scraping as a fallback when the JSON endpoint is rate-limiting you, or when you specifically need the rendered HTML for a particular field. Avoid scraping `www.reddit.com` (new Reddit) — it requires JavaScript execution and the React-rendered output is harder to parse reliably.

What User-Agent header should I send to Reddit?

Reddit's API rules require a descriptive User-Agent that identifies your application and a contact. The documented format is `platform:appname:version (by /u/yourusername)` — for example, `web:my-sentiment-tool:1.2 (by /u/myredditaccount)`. Generic User-Agents like `python-requests/2.28` or browser defaults trigger faster rate limiting. Pass your custom User-Agent via OmniScrape's `custom_headers` field.

Why am I getting 429 errors from Reddit?

Reddit enforces per-IP rate limits on unauthenticated requests. Common causes: too many requests per second from a single IP, datacenter IP ranges that Reddit treats more aggressively, missing or generic User-Agent, and sequential crawling patterns. Mitigations: use residential proxy rotation, add delays between requests (at least 1–2 seconds between calls to the same endpoint), set a proper User-Agent, and spread requests across different subreddits rather than hammering one. If you need high volume legitimately, the official paid API is the correct path. See web scraping without getting blocked for general IP rotation strategies.

How do I get comments beyond the first page of a large thread?

Reddit's thread JSON truncates large comment trees and replaces collapsed subtrees with `more` objects containing child IDs. To expand them, call `https://www.reddit.com/api/morechildren.json?api_type=json&link_id=t3_postid&children=id1,id2,...`. This endpoint is more aggressively rate-limited without OAuth. For most analytics use cases, fetching with `?limit=500&depth=10` on the thread URL and processing only the returned comments is sufficient — full tree expansion is only necessary for complete archival.

Can I scrape NSFW or quarantined subreddits?

No, not without authentication. NSFW subreddits require a logged-in Reddit account with age verification enabled. Quarantined subreddits require explicit opt-in. Anonymous requests to these subreddits return a redirect to a login or warning page rather than content. Scraping them without authorization violates Reddit's terms, and the login requirement means you would need to manage session cookies — a significantly more complex setup with additional legal exposure.

How do I paginate through a subreddit's post history?

Use the listing endpoint with `after` cursor pagination. Fetch `/r/subreddit/new.json?limit=100`, extract `data.after` from the response (a value like `t3_postid`), then fetch `/r/subreddit/new.json?limit=100&after=t3_postid` for the next page. Continue until `data.after` is null. Reddit listing endpoints expose approximately 1,000 posts per sort order — you cannot paginate further back than that through the standard API. For deeper historical data, check the current availability of the Pushshift Reddit dataset or Reddit's official Data API.

Is scraping Reddit legal for commercial use?

This is a legal question you should discuss with counsel, but the practical landscape is: Reddit's Developer Terms explicitly cover commercial data collection and require a paid agreement for commercial API use at scale. The 2023 hiQ v. LinkedIn ruling and related cases have complicated the CFAA analysis for public web scraping, but Reddit's ToS restriction on commercial use is a contractual issue separate from CFAA. Academic and personal use occupy a different position. Do not rely on 'it's public data' as a blanket justification for commercial products — read Reddit's current Developer Terms and consult legal counsel if you are building a commercial data product.

Related guides

  • Social Media Web Scraping: Brand Mention Monitoring from Public Pages
  • Sentiment Analysis Web Scraping: Build a Production Review Pipeline
  • Web Scraping with Python
  • Web Scraping Without Getting Blocked

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

Ready to get started?

Start scraping protected sites today — no credit card required.

OmniScrape

Web scraping infrastructure for developers. One API call to bypass any protection.

All systems operational

Product

  • Web Unlocker
  • Browser-as-a-Service
  • Residential Proxies
  • Pricing

Developers

  • API Reference ↗
  • Quickstart ↗
  • All Guides
  • Use Cases
  • Status

Company

  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • Cookie Policy
  • Acceptable Use

Solutions

  • E-commerce Web Scraping: Catalog Intelligence at Production Scale
  • Real Estate Web Scraping: Listings, Comps, and Market Data
  • SERP Web Scraping: Agency Rank Tracking Workflow
  • Job Board Web Scraping: HR Tech Pipeline for Labor Market Intelligence
  • Price Monitoring with Web Scraping: A Practical Developer Guide
  • Lead Generation Web Scraping: Compliant Inbound Enrichment for Sales Teams
  • Market Research Web Scraping: Multi-Geo Data Collection for Research Firms
  • Sentiment Analysis Web Scraping: Build a Production Review Pipeline
  • Logistics Web Scraping: Carrier Rates, Port ETAs, and Sailing Schedules
  • Social Media Web Scraping: Brand Mention Monitoring from Public Pages
  • LLM Training Data Scraping: Building Clean Web Corpora
  • Travel Web Scraping: Hotel Rates, Flight Fares & Parity Monitoring

Web Scraping by Language

  • Web Scraping with Python
  • Web Scraping with Node.js: fetch, Cheerio, and the OmniScrape API
  • Web Scraping with Java: HttpClient, Jsoup, and OmniScrape API
  • Web Scraping with PHP
  • Web Scraping with Go (Golang)
  • Web Scraping with Ruby: Faraday, Nokogiri, Sidekiq & OmniScrape
  • Web Scraping with C#: HttpClient, AngleSharp, and OmniScrape API
  • Web Scraping with Rust
  • Web Scraping with R: httr2, rvest, and the OmniScrape API
  • Web Scraping with C++
  • Web Scraping with Elixir
  • Web Scraping with Perl: Mojo::UserAgent, Mojo::DOM, and OmniScrape

Anti-Bot Bypass

  • How to Bypass Cloudflare When Web Scraping
  • How to Bypass DataDome When Web Scraping
  • How to Bypass Akamai Bot Manager When Web Scraping
  • How to Bypass PerimeterX (HUMAN Security) When Web Scraping
  • Bypassing AWS WAF When Web Scraping: Rate Rules, Bot Control, and Residential Proxies
  • How to Bypass Imperva (Incapsula) When Web Scraping
  • How to Bypass Kasada Bot Protection When Web Scraping
  • How to Bypass F5 BIG-IP Bot Defense When Web Scraping
  • How to Bypass Distil Networks When Web Scraping
  • How to Bypass reCAPTCHA When Web Scraping

Scraping Tools

  • Playwright Web Scraping: Practical Patterns for Protected Sites
  • Puppeteer Web Scraping: Patterns, Anti-Bot Limits, and BaaS Integration
  • Selenium Web Scraping: Practical Patterns for Real-World Projects
  • Scrapy Web Scraping with OmniScrape: Download Middleware, Pipelines, and Scale
  • Beautiful Soup Web Scraping: A Practical Guide
  • cURL Web Scraping: Shell-Native Patterns with OmniScrape
  • HTTPX Web Scraping: Async Python with OmniScrape
  • Cheerio Web Scraping: A Practical Guide

Site-Specific Scrapers

  • Amazon Scraper: Product Data, Buy Box, Reviews, and Multi-Marketplace
  • Google Search Scraper: Extract SERP Rankings and Features
  • Google Maps Scraper: Extract Business Listings and Place Data
  • LinkedIn Scraper: Companies, Jobs, and Public Profiles
  • Walmart Scraper: Prices, Stock, Rollback Deals, and Fulfillment Data
  • eBay Scraper: Extract Listings, Auctions, and Sold Prices
  • Shopify Scraper: Products, Variants, and JSON Endpoints
  • Indeed Scraper: Extract Job Listings, Salaries, and Company Data
  • Zillow Scraper: Extract Listings, Zestimates, and Price History
  • Reddit Scraper: Posts, Comments, and Subreddit Data
  • X (Twitter) Scraper: Tweets, Profiles, and Hashtags
  • Instagram Scraper: Posts, Reels, and Profile Metrics
  • TikTok Scraper: Extract Videos, Hashtags, and Trend Data
  • YouTube Scraper: Extract Video Metadata, Comments, and Channel Stats
  • Booking.com Scraper: Hotel Rates, Room Types, and Availability
  • Airbnb Scraper: Listings, Calendars, and Nightly Rates
  • Crunchbase Scraper: Extract Funding Rounds, Companies, and Investors
  • Yelp Scraper: Extract Business Listings, Ratings, and Reviews
  • Glassdoor Scraper: Employer Ratings, Salaries, and Review Data
  • Trustpilot Scraper: TrustScore, Star Distribution, and Review Monitoring

How We Compare

  • OmniScrape vs ScrapingBee
  • OmniScrape vs ZenRows
  • OmniScrape vs ScraperAPI: A Practical Developer Comparison
  • OmniScrape vs Bright Data: Which Web Scraping Platform Fits Your Team?
  • OmniScrape vs Oxylabs
  • OmniScrape vs Smartproxy
  • OmniScrape vs Crawlbase: API Design, Observability, and Migration Guide
  • OmniScrape vs Apify

Web Scraping Guides

  • Web Scraping Without Getting Blocked
  • Web Scraping Proxy Guide: Types, Sessions, Geo, and OmniScrape Integration
  • Solve CAPTCHAs While Web Scraping
  • Web Scraping vs Web Crawling: Architecture, Patterns, and When to Use Each
  • Headless Browser Scraping: When to Use It and How to Do It Right
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns
  • Rotating Proxies for Web Scraping: Policies, Session Binding, and Geo Pools
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs

© 2026 OmniScrape. All rights reserved.

PrivacyTermsRefundsAcceptable Use