1.Reddit data fields social listening teams extract
The fields worth collecting depend on your use case. Brand monitors care about mention volume, sentiment signals in post body and comments, and the score as a proxy for community endorsement. Researchers archiving threads for longitudinal studies need stable identifiers, timestamps, and the full comment tree with parent references so they can reconstruct conversation structure offline.
Below is a practical field inventory covering posts, comments, and subreddit-level metadata. Not all fields are available on every surface — the `.json` endpoint exposes the most complete set, while HTML scraping gives you the visible subset rendered on the page.
- Post id, title, selftext (body) or external link URL, author handle
- Score (net upvotes), upvote_ratio, num_comments
- created_utc timestamp and post flair text/color
- Subreddit name, subscriber count (subscribers), description, rules summary
- Comments: body, author, score, depth level, parent_id, created_utc
- Awards and gilding counts when visible in JSON (gilded, all_awardings)
- over_18 (NSFW) flag and quarantine flag at subreddit and post level
- Crosspost parent list (crosspost_parent_list) for tracking reposts
- Distinguished field (moderator/admin labels) and stickied flag
- Removal reason or removed_by_category for moderation research
2.Reddit URL patterns for scraping
Reddit's URL structure is consistent and predictable, which makes it straightforward to construct target URLs programmatically. The `.json` suffix works on most public listing and thread URLs — append it directly before any query string parameters. `old.reddit.com` mirrors the same path structure as `www.reddit.com` but serves static HTML instead of the React SPA.
When building a crawler, generate thread URLs from subreddit listing pages (e.g., `/r/python/new.json?limit=100`) and then fetch individual thread JSON for full comment data. Subreddit listing endpoints support `after` and `before` pagination cursors via the `after=t3_postid` query parameter.
- Post page: https://www.reddit.com/r/python/comments/abc123/title_slug/
- Post JSON: https://www.reddit.com/r/python/comments/abc123/title_slug.json
- Subreddit hot listing: https://www.reddit.com/r/MachineLearning/hot.json?limit=100
- Subreddit new listing: https://www.reddit.com/r/MachineLearning/new.json?limit=100
- Subreddit top (all time): https://www.reddit.com/r/MachineLearning/top.json?t=all&limit=100
- Pagination cursor: append &after=t3_postid to any listing URL
- old.reddit thread: https://old.reddit.com/r/python/comments/abc123/
- old.reddit subreddit: https://old.reddit.com/r/datascience/new/
- User profile (public): https://www.reddit.com/user/username/submitted.json
- Search results: https://www.reddit.com/search.json?q=keyword&sort=new&limit=100
3.The .json suffix: structured data without HTML parsing
Appending `.json` to a Reddit thread URL returns a two-element array: `[0]` is the post listing (one item), and `[1]` is the comment listing with a nested `children` array. Each comment is a `t1` kind object; the special `more` kind signals additional comments not yet loaded. This structure lets you reconstruct the full comment tree using `parent_id` references without any HTML parsing.
The response shape for a post is `data.children[0].data` — fields like `title`, `selftext`, `score`, `upvote_ratio`, `num_comments`, `created_utc`, and `author` are all present. Comments are at `[1].data.children`, each with `data.body`, `data.author`, `data.score`, `data.depth`, and `data.parent_id`.
Rate limits are enforced per IP. Reddit expects a descriptive `User-Agent` header identifying your application — the format from their API wiki is `platform:appname:version (by /u/yourusername)`. Without a proper User-Agent, requests are more aggressively rate-limited. Deleted comments show `[deleted]` as author and `[removed]` as body — preserve these markers in your schema rather than dropping the records, since the comment's position in the tree still carries structural meaning.
For subreddit listing endpoints, the response is a single listing object: `data.children` is an array of post objects, and `data.after` is the pagination cursor for the next page. Set `limit=100` (Reddit's maximum per request) to minimize the number of round trips needed to cover a subreddit's history.
4.old.reddit HTML structure and CSS selectors
`old.reddit.com` uses a stable, server-rendered HTML structure that has changed minimally over many years. Each post on a listing page is a `div.thing` element with data attributes (`data-fullname`, `data-score`, `data-comments-count`, `data-subreddit`) that are often more reliable than parsing visible text. On a thread page, the post itself is `div.thing.link` and comments are `div.thing.comment` nested inside `div.child` containers.
Key selectors for post listings: `a.title` for the post title and link, `time.live-timestamp` (with `datetime` attribute) for the ISO timestamp, `div.score.unvoted` for the score, `a.author` for the username, `a.subreddit` for the subreddit name, and `a.comments` for the comment count and thread link.
For thread pages, the post body is inside `div.usertext-body > div.md`. Comment bodies are `div.thing.comment div.usertext-body > div.md` — but because comments nest arbitrarily deep, a flat CSS selector will return all comments at all depths. If you need depth information, parse the nesting level from the `div.child` hierarchy or read the `data-depth` attribute on `div.thing.comment`.
new.reddit renders comment threads client-side via React. Fetching `www.reddit.com` thread pages without JavaScript execution returns a nearly empty shell. Always prefer `old.reddit.com` or the `.json` endpoint for scraping — only fall back to `js_rendering` mode if you specifically need new Reddit's UI or features not available elsewhere.
5.Rate limits, access controls, and bot detection
Reddit does not sit behind Cloudflare for most public pages, but it runs its own rate limiting and bot detection. The primary signals Reddit uses are request rate per IP, missing or generic User-Agent strings, and behavioral patterns like fetching sequential post IDs without any variation. Datacenter IPs are more aggressively limited than residential addresses.
The 2023 API changes introduced paid tiers for the official API. The public `.json` endpoints remain accessible without OAuth for read-only access to public subreddits, but Reddit has tightened enforcement. NSFW and quarantined subreddits require a logged-in session with age verification — you cannot scrape these with anonymous requests. Private subreddits require membership and cannot be accessed without authentication.
For high-volume collection, residential proxy rotation is the most effective mitigation for IP-based rate limits. Spread requests across subreddits and time rather than hammering a single endpoint. If you are building a production pipeline, consider the official Data API for large-scale commercial use — it is the only path that is clearly within Reddit's terms for commercial products.
- 429 Too Many Requests from rapid unauthenticated requests — back off and rotate IPs
- OAuth required for official API; free tier removed for commercial use in 2023
- NSFW and quarantined subreddits require authenticated session with age confirmation
- new.reddit thread pages require JavaScript rendering for full comment load
- Shadowbanned users' posts appear deleted in HTML; their JSON entries show removed status
- Missing or bot-like User-Agent strings trigger faster rate limiting
- Sequential ID crawling patterns are more likely to trigger blocks than organic browsing patterns
6.Fetch a Reddit thread as JSON with OmniScrape
Use OmniScrape to fetch the `.json` URL for a Reddit thread. The response arrives in `body.data.content` as a JSON string — parse it in your worker to access the post and comment arrays. `mode: "auto"` is sufficient for most Reddit JSON endpoints since they are server-rendered responses; OmniScrape will use the fast HTTP lane and only escalate to a browser if needed.
Use a residential US proxy to reduce the chance of hitting IP-based rate limits. If you need to pass a custom `User-Agent` header to satisfy Reddit's API rules, add it via the `custom_headers` field.
123456789{
"url": "https://www.reddit.com/r/webscraping/comments/1a2b3c4/how_do_you_handle_rate_limits.json",
"mode": "auto",
"output_format": "html",
"proxy": "residential:us",
"custom_headers": {
"User-Agent": "web:omniscrape-reddit-example:1.0 (by /u/your_reddit_username)"
}
}
7.Scrape old.reddit HTML with CSS extraction
When you need specific visible fields from a thread without parsing the full JSON tree, CSS extraction against `old.reddit.com` is efficient. OmniScrape runs the selectors server-side and returns only the matched values in `body.data.css_extracted` — no need to download and parse the full HTML in your worker.
The selectors below target the post-level fields on a thread page. For comment bodies, add a multi-match selector for `div.thing.comment div.usertext-body > div.md` — the response will include an array of all matched elements. Note that `div.score.unvoted` may render as `div.score.likes` or `div.score.dislikes` depending on vote state; `div.score` alone is a safer selector if you do not need vote direction.
12345678910111213141516171819{
"url": "https://old.reddit.com/r/datascience/comments/xyz789/weekly_thread/",
"mode": "auto",
"output_format": "css_extractor",
"proxy": "residential:us",
"custom_headers": {
"User-Agent": "web:omniscrape-reddit-example:1.0 (by /u/your_reddit_username)"
},
"css_selectors": {
"title": "a.title",
"score": "div.score",
"author": "a.author",
"subreddit": "a.subreddit",
"comment_count": "a.comments",
"post_body": "div.usertext-body > div.md",
"timestamp": "time.live-timestamp",
"flair": "span.flair"
}
}
9.Reddit API terms, researcher access, and data handling
Reddit's Developer Terms distinguish between personal use, academic research, and commercial use. The 2023 policy changes removed the free commercial tier from the official API — large-scale commercial data products require a paid Data API agreement. Scraping public HTML does not automatically exempt a product from these terms; Reddit's ToS covers data collection methods beyond just the official API.
Academic researchers have access to separate programs including the Academic Research API (check Reddit's current developer portal for availability and eligibility). If you are building a research tool for a university or non-profit, this is the correct path rather than HTML scraping at scale.
Regardless of access method, handle scraped Reddit data responsibly: do not deanonymize pseudonymous users by correlating Reddit handles with real identities, do not use comment history to harass or profile individuals, and respect deletion — if a user deletes their post or account, remove that content from your dataset in accordance with your data retention policy. GDPR and similar regulations may apply if your users or data subjects are in covered jurisdictions, even though Reddit is a US platform.
Store only the fields your use case requires. Retaining full comment histories for users who have since deleted their accounts creates legal and ethical exposure. Build deletion propagation into your pipeline from the start rather than retrofitting it later.
Frequently asked questions
Should I use old.reddit.com or the .json endpoint for scraping?
The `.json` endpoint is almost always the better choice: it gives you structured data with all fields including ones not rendered in HTML (upvote_ratio, created_utc as a Unix timestamp, crosspost data, moderation flags), and you avoid HTML parsing entirely. Use `old.reddit.com` HTML scraping as a fallback when the JSON endpoint is rate-limiting you, or when you specifically need the rendered HTML for a particular field. Avoid scraping `www.reddit.com` (new Reddit) — it requires JavaScript execution and the React-rendered output is harder to parse reliably.
What User-Agent header should I send to Reddit?
Reddit's API rules require a descriptive User-Agent that identifies your application and a contact. The documented format is `platform:appname:version (by /u/yourusername)` — for example, `web:my-sentiment-tool:1.2 (by /u/myredditaccount)`. Generic User-Agents like `python-requests/2.28` or browser defaults trigger faster rate limiting. Pass your custom User-Agent via OmniScrape's `custom_headers` field.
Why am I getting 429 errors from Reddit?
Reddit enforces per-IP rate limits on unauthenticated requests. Common causes: too many requests per second from a single IP, datacenter IP ranges that Reddit treats more aggressively, missing or generic User-Agent, and sequential crawling patterns. Mitigations: use residential proxy rotation, add delays between requests (at least 1–2 seconds between calls to the same endpoint), set a proper User-Agent, and spread requests across different subreddits rather than hammering one. If you need high volume legitimately, the official paid API is the correct path. See web scraping without getting blocked for general IP rotation strategies.
How do I get comments beyond the first page of a large thread?
Reddit's thread JSON truncates large comment trees and replaces collapsed subtrees with `more` objects containing child IDs. To expand them, call `https://www.reddit.com/api/morechildren.json?api_type=json&link_id=t3_postid&children=id1,id2,...`. This endpoint is more aggressively rate-limited without OAuth. For most analytics use cases, fetching with `?limit=500&depth=10` on the thread URL and processing only the returned comments is sufficient — full tree expansion is only necessary for complete archival.
Can I scrape NSFW or quarantined subreddits?
No, not without authentication. NSFW subreddits require a logged-in Reddit account with age verification enabled. Quarantined subreddits require explicit opt-in. Anonymous requests to these subreddits return a redirect to a login or warning page rather than content. Scraping them without authorization violates Reddit's terms, and the login requirement means you would need to manage session cookies — a significantly more complex setup with additional legal exposure.
How do I paginate through a subreddit's post history?
Use the listing endpoint with `after` cursor pagination. Fetch `/r/subreddit/new.json?limit=100`, extract `data.after` from the response (a value like `t3_postid`), then fetch `/r/subreddit/new.json?limit=100&after=t3_postid` for the next page. Continue until `data.after` is null. Reddit listing endpoints expose approximately 1,000 posts per sort order — you cannot paginate further back than that through the standard API. For deeper historical data, check the current availability of the Pushshift Reddit dataset or Reddit's official Data API.
Is scraping Reddit legal for commercial use?
This is a legal question you should discuss with counsel, but the practical landscape is: Reddit's Developer Terms explicitly cover commercial data collection and require a paid agreement for commercial API use at scale. The 2023 hiQ v. LinkedIn ruling and related cases have complicated the CFAA analysis for public web scraping, but Reddit's ToS restriction on commercial use is a contractual issue separate from CFAA. Academic and personal use occupy a different position. Do not rely on 'it's public data' as a blanket justification for commercial products — read Reddit's current Developer Terms and consult legal counsel if you are building a commercial data product.
Related guides
8.Comment pagination and large thread handling
Reddit truncates comment trees in the `.json` response for large threads. The default response includes up to 200 top-level comments and collapses deep subtrees into `more` objects — these contain a list of child IDs that must be fetched separately via the `morechildren` API endpoint (`https://www.reddit.com/api/morechildren.json`). Without OAuth, `morechildren` is rate-limited more aggressively than regular thread fetches.
For most brand monitoring and sentiment analysis use cases, the first-page comment response is sufficient. Set `?limit=500&depth=10` on the thread JSON URL to maximize the comments returned in a single request — Reddit caps this, but you will get significantly more than the default. For threads with thousands of comments, plan for multiple `morechildren` fetches and implement exponential backoff on 429 responses.
When archiving at scale, prioritize breadth over depth: collect post metadata and top-level comments across many threads rather than exhaustively expanding every `more` object in a single viral thread. Store `more` IDs in your queue for later expansion if completeness is required. See social media web scraping for archiving pipeline patterns and compliance considerations.
For subreddit history collection, use the listing endpoint pagination cursor (`after=t3_postid`) to walk backwards through time. Reddit listing endpoints only expose approximately 1,000 posts per sort (the Pushshift API, now restricted, was the common workaround for deeper history — check current availability and terms before relying on it).