X (Twitter) Scraper: Tweets, Profiles, and Hashtags

1.Data fields social monitoring products try to capture

The exact fields your pipeline needs determines whether HTML scraping is viable at all. Crisis communications teams care about mention velocity and reach — they need follower counts and repost rates in near real-time. Academic researchers studying public discourse want full tweet text, timestamps, and language tags for corpus analysis. Influencer marketing platforms need engagement rate calculations across sponsored posts, which requires like, reply, repost, and view counts together.

Below is the full set of fields that production monitoring systems typically require. Fields marked as difficult are either login-gated, absent from logged-out HTML, or only available via the official API.

Tweet ID (numeric status ID from URL — always available on permalink pages)
Tweet text and language code (available logged-out on individual status URLs, intermittently)
created_at timestamp (rendered as a <time> element with datetime attribute)
Like, repost, reply, and view counts (data-testid buttons; view counts often absent logged-out)
Author handle, display name, and avatar URL
Follower and following counts (profile header, sometimes rendered logged-out)
Hashtags, cashtags, and outbound link URLs (parsed from tweet text or anchor tags)
Quote tweet and reply parent references (parent tweet URL in thread context)
Media URLs — image src and video poster thumbnail (present in DOM on some logged-out views)
Verified badge status and account type (blue, gold, grey checkmark)
Profile bio, location string, and join date (profile header)
Pinned tweet reference

2.X.com URL patterns worth knowing

Tweet permalink URLs are stable as long as the numeric status ID exists. The handle in the URL is cosmetic — X redirects any handle to the correct one if the ID is valid, which matters when accounts change usernames. Search, hashtag, and list URLs are almost universally login-gated for logged-out visitors as of 2024, making them poor targets for HTML scraping at scale.

Note that x.com and twitter.com are the same origin — both redirect to x.com. Normalise all stored URLs to x.com/status/{id} to avoid duplicates in your database.

Tweet permalink: https://x.com/{handle}/status/{tweet_id} — most reliable logged-out target
Profile root: https://x.com/{handle} — bio and follower counts sometimes render logged-out
Profile media tab: https://x.com/{handle}/media — usually login-gated
Search: https://x.com/search?q={query}&src=typed_query — login wall in most regions
Hashtag: https://x.com/hashtag/{tag} — login-gated since mid-2023
List: https://x.com/i/lists/{list_id} — requires login
Moments / Events: https://x.com/i/events/{id} — inconsistent logged-out rendering

3.X HTML structure and embedded JSON state

X is a Next.js application. The server renders an initial HTML payload that sometimes includes tweet content for logged-out users on individual status URLs, but the React hydration layer controls what actually renders in the browser. This means a plain HTTP fetch (no JavaScript execution) will often return an HTML shell with no tweet content — you need a headless browser to get the fully rendered DOM.

When tweet content does render, the key data-testid attributes are: article[data-testid="tweet"] as the root tweet container, div[data-testid="tweetText"] for the tweet body, time[datetime] for the ISO 8601 timestamp, and the engagement buttons — [data-testid="reply"], [data-testid="retweet"], [data-testid="like"]. View counts appear in an anchor linking to /analytics. These testid values have been relatively stable but are not guaranteed — X engineers do rename them.

X also embeds a __NEXT_DATA__ JSON blob in a <script id="__NEXT_DATA__"> tag. This blob sometimes contains structured tweet entity data including full text, entities (urls, hashtags, media), and author objects. Its schema changes without notice and is not documented. Parsing it is fragile but can yield cleaner structured data than CSS extraction when it is present. Always fall back to CSS selectors when the blob is absent or restructured.

Tweet ID is always recoverable from the URL path — /status/NUMERIC — regardless of DOM structure. Treat the URL as the source of truth for the ID.

4.Login walls, rate limits, and anti-bot measures

X's access restrictions are intentional product decisions, not incidental bot defences. Since 2023, most timeline, search, hashtag, and list views redirect logged-out visitors to a login modal before any tweet content renders. Individual status permalinks are the last remaining surface that sometimes serves content to logged-out users, and even these are inconsistent — X has been progressively tightening this path.

At the infrastructure level, datacenter IP ranges are blocked aggressively. Residential proxies improve success rates on individual permalink pages, but are not a reliable solution for search or hashtag pages where the login wall is enforced server-side. Rate limits apply per IP and per session. Frequent React component restructuring means CSS selectors that work today may break within weeks of a frontend deploy.

Budget realistically: if you are building a production system on X HTML scraping, allocate engineering time for weekly selector audits and expect periods of zero data when X ships breaking changes. For anything requiring search, timelines, or hashtag tracking at scale, the official API is the only sustainable path.

Login modal on search, hashtag pages, lists, and most timeline views
Server-side login enforcement — residential proxies do not bypass it on gated routes
Aggressive blocking of datacenter IP ranges on all routes
Rate limiting per IP and per browser session fingerprint
Frequent React component restructuring that breaks CSS selectors
Age-restricted and geo-withheld content varies by proxy exit region
X Terms of Service restrict automated collection — legal review required before production use

5.Scrape a public tweet permalink

Individual status URLs are the most viable HTML scraping target on X. Use mode js_rendering because the tweet content is rendered by React in the browser — a plain HTTP fetch returns an empty shell. Set js_wait_selector to [data-testid="tweetText"] so the request waits until the tweet body is present in the DOM before extracting. If the selector never appears, X has served a login modal instead of content.

Use a residential US proxy. X's logged-out rendering is inconsistent across regions and more reliable on US exit nodes. Set js_wait_timeout to at least 12 seconds — X's JS bundle is large and hydration is slow.

The response data arrives in body.data.css_extracted when output_format is css_extractor. Check body.success before processing. A successful extraction with empty text fields usually means a login modal rendered — treat it as a soft failure and do not retry the same URL immediately.

X tweet permalink — js_rendering with CSS extraction

json

12345678910111213141516171819{
  "url": "https://x.com/OpenAI/status/1234567890123456789",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "js_wait_selector": "[data-testid=\"tweetText\"]",
  "js_wait_timeout": 12000,
  "css_selectors": {
    "text": "[data-testid=\"tweetText\"]",
    "author_name": "[data-testid=\"User-Name\"] span:first-child",
    "author_handle": "[data-testid=\"User-Name\"] a",
    "timestamp": "time[datetime]",
    "replies": "[data-testid=\"reply\"] span[data-testid=\"app-text-transition-container\"]",
    "retweets": "[data-testid=\"retweet\"] span[data-testid=\"app-text-transition-container\"]",
    "likes": "[data-testid=\"like\"] span[data-testid=\"app-text-transition-container\"]",
    "views": "a[href$=\"/analytics\"] span",
    "media_img": "[data-testid=\"tweetPhoto\"] img"
  }
}

6.Scrape a public profile header

Profile root pages (/handle) sometimes render bio, follower count, and following count for logged-out visitors. Recent tweets on the same page usually do not render — expect the tweet list to be empty or replaced by a login prompt. Treat profile scraping as a way to capture static metadata (bio, location, join date, follower count snapshot) rather than a feed of recent activity.

Wait for [data-testid="UserName"] to confirm the profile header has hydrated. The follower count anchor uses href ending in /verified_followers on some account types — verify the selector against the specific account type you are targeting, as the href pattern differs for some verified organisations.

Profile scrapes are lower frequency than tweet scrapes — run them on a schedule (e.g. daily per account) rather than on every mention event. Cache the result and only re-scrape when you need a fresh follower count snapshot.

X profile header — js_rendering with CSS extraction

json

12345678910111213141516171819{
  "url": "https://x.com/nasa",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "js_wait_selector": "[data-testid=\"UserName\"]",
  "js_wait_timeout": 12000,
  "css_selectors": {
    "display_name": "[data-testid=\"UserName\"] span:first-child",
    "handle": "[data-testid=\"UserName\"] a",
    "bio": "[data-testid=\"UserDescription\"]",
    "followers": "a[href$=\"/verified_followers\"] span, a[href$=\"/followers\"] span",
    "following": "a[href$=\"/following\"] span",
    "location": "[data-testid=\"UserLocation\"] span",
    "join_date": "[data-testid=\"UserJoinDate\"] span",
    "website": "[data-testid=\"UserUrl\"] a",
    "verified_badge": "[data-testid=\"icon-verified\"]"
  }
}

7.Official X API vs HTML scraping — choosing the right approach

The X API v2 provides tweet lookup by ID, user lookup by handle or ID, search (recent and full-archive on paid tiers), filtered stream for real-time keyword monitoring, and timelines. For any production system that requires search, hashtag tracking, or timeline access, API licensing is the only reliable path. The engineering cost of maintaining HTML scrapers that break monthly will exceed API subscription costs for most teams within a year.

HTML scraping with OmniScrape is appropriate for specific, lower-frequency use cases: ad-hoc research on a set of known tweet IDs, one-off competitive analysis of public profile metadata, or prototyping before committing to API budget. It is not appropriate for real-time monitoring, keyword search, or hashtag tracking.

If you must automate flows that require a logged-in session (for accounts you own and operate), read headless browser scraping for session management patterns. Be aware that automating login flows on accounts you do not own violates X's terms.

For pipelines that combine multiple social platforms, social media web scraping covers cross-platform architecture patterns including rate limit management and data normalisation.

8.Deduplication: store tweet IDs, not URLs

The numeric status ID is the canonical primary key for any tweet. x.com and twitter.com both serve the same content — a tweet at twitter.com/nasa/status/123 and x.com/nasa/status/123 are identical records. Normalise all URLs to x.com/status/{id} before storage, or better, store only the numeric ID and reconstruct URLs on demand.

Account handle changes are common. The handle in a tweet permalink URL is cosmetic — x.com redirects any handle to the correct one as long as the status ID is valid. Do not use the handle as part of a composite primary key. Store handle separately as a snapshot value with a captured_at timestamp.

Deleted tweets return a soft 404 — X renders a 'this tweet is unavailable' message rather than an HTTP 404 status in many cases. Detect deletion by checking for the absence of [data-testid="tweetText"] after a successful page load and mark the record as deleted_at in your database rather than removing it. Retaining deleted tweet metadata (ID, author, timestamp) is valuable for gap analysis in research datasets.

For high-volume pipelines, implement a seen-IDs bloom filter or a Redis SET of processed tweet IDs before hitting your primary database. Tweet IDs are monotonically increasing (Snowflake format) — you can use ID range comparisons to efficiently identify new tweets without full table scans.

9.X Terms of Service and legal considerations

X's Terms of Service explicitly restrict scraping and require API use for most forms of automated data collection. Section 4 of the Developer Agreement prohibits reverse engineering the platform and collecting data outside of official API access. These restrictions have been enforced — X has pursued legal action against large-scale scrapers.

Beyond X's own terms, storing public tweets at scale may implicate EU GDPR (tweets contain personal data), the EU DSA (platform data access obligations apply to researchers under specific conditions), and CCPA for California residents. Academic researchers in the EU may access X data under DSA Article 40 researcher access provisions — consult your institution's legal team.

OmniScrape provides the technical capability to fetch public web pages. It does not grant any rights to X's data, does not indemnify users against X's terms enforcement, and does not constitute legal advice. Engage your legal team before building any production system that stores X data at scale.

Frequently asked questions

Can I scrape X search results without logging in?

No, not reliably. Since mid-2023, X enforces a login wall on search results server-side — a residential proxy and headless browser will render the login modal, not search results. The only reliable way to access search is through the official X API v2 search endpoint on a paid tier. For monitoring specific accounts, scraping individual tweet permalinks by known status ID is a more viable logged-out approach.

Why does my X scrape return a login modal instead of tweet content?

X is serving the login wall for that URL and IP combination. Check three things: (1) confirm you are using a residential proxy — datacenter IPs are blocked more aggressively; (2) confirm you are targeting an individual status permalink (/status/{id}), not a search or timeline URL; (3) confirm js_rendering mode is set, since tweet content requires JavaScript execution. Even with all three correct, success is not guaranteed — X's logged-out rendering is intentionally inconsistent. If body.data.css_extracted returns empty values for tweetText, treat it as a login-wall hit and back off before retrying.

Is Nitter a viable alternative to direct X.com scraping?

No. X rate-limited Nitter instances into failure in early 2024 by restricting the guest token API that Nitter relied on. The few remaining public instances are unreliable and serve stale or partial data. Do not build production pipelines on Nitter. Self-hosted Nitter instances face the same guest token restrictions.

How do I track hashtags or keywords at scale?

Use the X API v2 filtered stream (available on Basic tier and above) for real-time keyword and hashtag monitoring. For historical search, the full-archive search endpoint is available on Pro and Enterprise tiers. HTML scraping hashtag pages is not viable — they are login-gated and the volume of updates makes polling impractical even if you could access them.

What is the X API pricing structure for search access?

X API pricing changes frequently — check developer.x.com for current rates. As of the guide's writing, Basic tier provides limited monthly tweet reads and recent search access; Pro tier provides higher volume and full-archive search; Enterprise tier provides firehose and custom volume. OmniScrape does not resell X API access — this guide covers HTML scraping only.

How do I handle tweet deletions in my dataset?

Deleted tweets typically return a page that renders 'this tweet is unavailable' rather than an HTTP 404. Detect deletion by checking for the absence of [data-testid="tweetText"] after a confirmed page load (body.success true, page loaded, but no tweet text extracted). Mark the record with a deleted_at timestamp in your database. Do not delete the row — retaining the tweet ID, author ID, and original timestamp is valuable for dataset integrity and gap analysis.

Can I scrape follower lists or following lists from profiles?

No. Follower and following list pages (/followers, /following) require login and are not accessible to logged-out scrapers. You can capture a snapshot of the follower and following counts from the profile header page (which sometimes renders logged-out), but not the individual accounts in those lists. For follower graph data, the X API v2 followers/following lookup endpoints are the only viable option.

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.Data fields social monitoring products try to capture

Tweet ID (numeric status ID from URL — always available on permalink pages)
Tweet text and language code (available logged-out on individual status URLs, intermittently)
created_at timestamp (rendered as a <time> element with datetime attribute)
Like, repost, reply, and view counts (data-testid buttons; view counts often absent logged-out)
Author handle, display name, and avatar URL
Follower and following counts (profile header, sometimes rendered logged-out)
Hashtags, cashtags, and outbound link URLs (parsed from tweet text or anchor tags)
Quote tweet and reply parent references (parent tweet URL in thread context)
Media URLs — image src and video poster thumbnail (present in DOM on some logged-out views)
Verified badge status and account type (blue, gold, grey checkmark)
Profile bio, location string, and join date (profile header)
Pinned tweet reference

2.X.com URL patterns worth knowing

Note that x.com and twitter.com are the same origin — both redirect to x.com. Normalise all stored URLs to x.com/status/{id} to avoid duplicates in your database.

Tweet permalink: https://x.com/{handle}/status/{tweet_id} — most reliable logged-out target
Profile root: https://x.com/{handle} — bio and follower counts sometimes render logged-out
Profile media tab: https://x.com/{handle}/media — usually login-gated
Search: https://x.com/search?q={query}&src=typed_query — login wall in most regions
Hashtag: https://x.com/hashtag/{tag} — login-gated since mid-2023
List: https://x.com/i/lists/{list_id} — requires login
Moments / Events: https://x.com/i/events/{id} — inconsistent logged-out rendering

3.X HTML structure and embedded JSON state

Tweet ID is always recoverable from the URL path — /status/NUMERIC — regardless of DOM structure. Treat the URL as the source of truth for the ID.

4.Login walls, rate limits, and anti-bot measures

Login modal on search, hashtag pages, lists, and most timeline views
Server-side login enforcement — residential proxies do not bypass it on gated routes
Aggressive blocking of datacenter IP ranges on all routes
Rate limiting per IP and per browser session fingerprint
Frequent React component restructuring that breaks CSS selectors
Age-restricted and geo-withheld content varies by proxy exit region
X Terms of Service restrict automated collection — legal review required before production use

5.Scrape a public tweet permalink

X tweet permalink — js_rendering with CSS extraction

json

12345678910111213141516171819{
  "url": "https://x.com/OpenAI/status/1234567890123456789",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "js_wait_selector": "[data-testid=\"tweetText\"]",
  "js_wait_timeout": 12000,
  "css_selectors": {
    "text": "[data-testid=\"tweetText\"]",
    "author_name": "[data-testid=\"User-Name\"] span:first-child",
    "author_handle": "[data-testid=\"User-Name\"] a",
    "timestamp": "time[datetime]",
    "replies": "[data-testid=\"reply\"] span[data-testid=\"app-text-transition-container\"]",
    "retweets": "[data-testid=\"retweet\"] span[data-testid=\"app-text-transition-container\"]",
    "likes": "[data-testid=\"like\"] span[data-testid=\"app-text-transition-container\"]",
    "views": "a[href$=\"/analytics\"] span",
    "media_img": "[data-testid=\"tweetPhoto\"] img"
  }
}

6.Scrape a public profile header

X profile header — js_rendering with CSS extraction

json

12345678910111213141516171819{
  "url": "https://x.com/nasa",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "js_wait_selector": "[data-testid=\"UserName\"]",
  "js_wait_timeout": 12000,
  "css_selectors": {
    "display_name": "[data-testid=\"UserName\"] span:first-child",
    "handle": "[data-testid=\"UserName\"] a",
    "bio": "[data-testid=\"UserDescription\"]",
    "followers": "a[href$=\"/verified_followers\"] span, a[href$=\"/followers\"] span",
    "following": "a[href$=\"/following\"] span",
    "location": "[data-testid=\"UserLocation\"] span",
    "join_date": "[data-testid=\"UserJoinDate\"] span",
    "website": "[data-testid=\"UserUrl\"] a",
    "verified_badge": "[data-testid=\"icon-verified\"]"
  }
}

7.Official X API vs HTML scraping — choosing the right approach

For pipelines that combine multiple social platforms, social media web scraping covers cross-platform architecture patterns including rate limit management and data normalisation.

8.Deduplication: store tweet IDs, not URLs

9.X Terms of Service and legal considerations

Frequently asked questions

Can I scrape X search results without logging in?

Why does my X scrape return a login modal instead of tweet content?

Is Nitter a viable alternative to direct X.com scraping?

How do I track hashtags or keywords at scale?

What is the X API pricing structure for search access?

How do I handle tweet deletions in my dataset?

Can I scrape follower lists or following lists from profiles?

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.