OmniScrape
ProductsSolutionsGuidesDocs ↗PricingAbout
ProductsSolutionsGuidesDocs ↗PricingAbout
← All guides
Site-Specific Scrapers

X (Twitter) Scraper: Tweets, Profiles, and Hashtags

X (formerly Twitter) restructured data access aggressively starting in 2023: the free API tier was eliminated, login walls appeared across search, timelines, and most profile views, and third-party Nitter instances were shut down through rate-limit enforcement. Teams that previously relied on the free v2 API or open Nitter proxies now face a narrower set of options — paid X API tiers, licensed firehose access, or HTML scraping that demands ongoing maintenance every time the React frontend ships a structural change.

This guide documents what remains accessible on public X.com HTML without authentication, the precise selectors that work today, and the architectural trade-offs between API access and HTML scraping. It is written for engineers building social monitoring pipelines, brand-mention trackers, and research tooling — not for casual use. For broader social monitoring architecture, see social media web scraping. For handling heavily protected pages in general, see web scraping without getting blocked.

On this page

1. Data fields social monitoring products try to capture2. X.com URL patterns worth knowing3. X HTML structure and embedded JSON state4. Login walls, rate limits, and anti-bot measures5. Scrape a public tweet permalink6. Scrape a public profile header7. Official X API vs HTML scraping — choosing the right approach8. Deduplication: store tweet IDs, not URLs9. X Terms of Service and legal considerations10. FAQ

1.Data fields social monitoring products try to capture

The exact fields your pipeline needs determines whether HTML scraping is viable at all. Crisis communications teams care about mention velocity and reach — they need follower counts and repost rates in near real-time. Academic researchers studying public discourse want full tweet text, timestamps, and language tags for corpus analysis. Influencer marketing platforms need engagement rate calculations across sponsored posts, which requires like, reply, repost, and view counts together.

Below is the full set of fields that production monitoring systems typically require. Fields marked as difficult are either login-gated, absent from logged-out HTML, or only available via the official API.

  • Tweet ID (numeric status ID from URL — always available on permalink pages)
  • Tweet text and language code (available logged-out on individual status URLs, intermittently)
  • created_at timestamp (rendered as a <time> element with datetime attribute)
  • Like, repost, reply, and view counts (data-testid buttons; view counts often absent logged-out)
  • Author handle, display name, and avatar URL
  • Follower and following counts (profile header, sometimes rendered logged-out)
  • Hashtags, cashtags, and outbound link URLs (parsed from tweet text or anchor tags)
  • Quote tweet and reply parent references (parent tweet URL in thread context)
  • Media URLs — image src and video poster thumbnail (present in DOM on some logged-out views)
  • Verified badge status and account type (blue, gold, grey checkmark)
  • Profile bio, location string, and join date (profile header)
  • Pinned tweet reference

2.X.com URL patterns worth knowing

Tweet permalink URLs are stable as long as the numeric status ID exists. The handle in the URL is cosmetic — X redirects any handle to the correct one if the ID is valid, which matters when accounts change usernames. Search, hashtag, and list URLs are almost universally login-gated for logged-out visitors as of 2024, making them poor targets for HTML scraping at scale.

Note that x.com and twitter.com are the same origin — both redirect to x.com. Normalise all stored URLs to x.com/status/{id} to avoid duplicates in your database.

  • Tweet permalink: https://x.com/{handle}/status/{tweet_id} — most reliable logged-out target
  • Profile root: https://x.com/{handle} — bio and follower counts sometimes render logged-out
  • Profile media tab: https://x.com/{handle}/media — usually login-gated
  • Search: https://x.com/search?q={query}&src=typed_query — login wall in most regions
  • Hashtag: https://x.com/hashtag/{tag} — login-gated since mid-2023
  • List: https://x.com/i/lists/{list_id} — requires login
  • Moments / Events: https://x.com/i/events/{id} — inconsistent logged-out rendering

3.X HTML structure and embedded JSON state

X is a Next.js application. The server renders an initial HTML payload that sometimes includes tweet content for logged-out users on individual status URLs, but the React hydration layer controls what actually renders in the browser. This means a plain HTTP fetch (no JavaScript execution) will often return an HTML shell with no tweet content — you need a headless browser to get the fully rendered DOM.

When tweet content does render, the key data-testid attributes are: article[data-testid="tweet"] as the root tweet container, div[data-testid="tweetText"] for the tweet body, time[datetime] for the ISO 8601 timestamp, and the engagement buttons — [data-testid="reply"], [data-testid="retweet"], [data-testid="like"]. View counts appear in an anchor linking to /analytics. These testid values have been relatively stable but are not guaranteed — X engineers do rename them.

X also embeds a __NEXT_DATA__ JSON blob in a <script id="__NEXT_DATA__"> tag. This blob sometimes contains structured tweet entity data including full text, entities (urls, hashtags, media), and author objects. Its schema changes without notice and is not documented. Parsing it is fragile but can yield cleaner structured data than CSS extraction when it is present. Always fall back to CSS selectors when the blob is absent or restructured.

Tweet ID is always recoverable from the URL path — /status/NUMERIC — regardless of DOM structure. Treat the URL as the source of truth for the ID.

4.Login walls, rate limits, and anti-bot measures

X's access restrictions are intentional product decisions, not incidental bot defences. Since 2023, most timeline, search, hashtag, and list views redirect logged-out visitors to a login modal before any tweet content renders. Individual status permalinks are the last remaining surface that sometimes serves content to logged-out users, and even these are inconsistent — X has been progressively tightening this path.

At the infrastructure level, datacenter IP ranges are blocked aggressively. Residential proxies improve success rates on individual permalink pages, but are not a reliable solution for search or hashtag pages where the login wall is enforced server-side. Rate limits apply per IP and per session. Frequent React component restructuring means CSS selectors that work today may break within weeks of a frontend deploy.

Budget realistically: if you are building a production system on X HTML scraping, allocate engineering time for weekly selector audits and expect periods of zero data when X ships breaking changes. For anything requiring search, timelines, or hashtag tracking at scale, the official API is the only sustainable path.

  • Login modal on search, hashtag pages, lists, and most timeline views
  • Server-side login enforcement — residential proxies do not bypass it on gated routes
  • Aggressive blocking of datacenter IP ranges on all routes
  • Rate limiting per IP and per browser session fingerprint
  • Frequent React component restructuring that breaks CSS selectors
  • Age-restricted and geo-withheld content varies by proxy exit region
  • X Terms of Service restrict automated collection — legal review required before production use

5.Scrape a public tweet permalink

Individual status URLs are the most viable HTML scraping target on X. Use mode js_rendering because the tweet content is rendered by React in the browser — a plain HTTP fetch returns an empty shell. Set js_wait_selector to [data-testid="tweetText"] so the request waits until the tweet body is present in the DOM before extracting. If the selector never appears, X has served a login modal instead of content.

Use a residential US proxy. X's logged-out rendering is inconsistent across regions and more reliable on US exit nodes. Set js_wait_timeout to at least 12 seconds — X's JS bundle is large and hydration is slow.

The response data arrives in body.data.css_extracted when output_format is css_extractor. Check body.success before processing. A successful extraction with empty text fields usually means a login modal rendered — treat it as a soft failure and do not retry the same URL immediately.

X tweet permalink — js_rendering with CSS extraction
json
12345678910111213141516171819{
  "url": "https://x.com/OpenAI/status/1234567890123456789",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "js_wait_selector": "[data-testid=\"tweetText\"]",
  "js_wait_timeout": 12000,
  "css_selectors": {
    "text": "[data-testid=\"tweetText\"]",
    "author_name": "[data-testid=\"User-Name\"] span:first-child",
    "author_handle": "[data-testid=\"User-Name\"] a",
    "timestamp": "time[datetime]",
    "replies": "[data-testid=\"reply\"] span[data-testid=\"app-text-transition-container\"]",
    "retweets": "[data-testid=\"retweet\"] span[data-testid=\"app-text-transition-container\"]",
    "likes": "[data-testid=\"like\"] span[data-testid=\"app-text-transition-container\"]",
    "views": "a[href$=\"/analytics\"] span",
    "media_img": "[data-testid=\"tweetPhoto\"] img"
  }
}

6.Scrape a public profile header

Profile root pages (/handle) sometimes render bio, follower count, and following count for logged-out visitors. Recent tweets on the same page usually do not render — expect the tweet list to be empty or replaced by a login prompt. Treat profile scraping as a way to capture static metadata (bio, location, join date, follower count snapshot) rather than a feed of recent activity.

Wait for [data-testid="UserName"] to confirm the profile header has hydrated. The follower count anchor uses href ending in /verified_followers on some account types — verify the selector against the specific account type you are targeting, as the href pattern differs for some verified organisations.

Profile scrapes are lower frequency than tweet scrapes — run them on a schedule (e.g. daily per account) rather than on every mention event. Cache the result and only re-scrape when you need a fresh follower count snapshot.

X profile header — js_rendering with CSS extraction
json
12345678910111213141516171819{
  "url": "https://x.com/nasa",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "js_wait_selector": "[data-testid=\"UserName\"]",
  "js_wait_timeout": 12000,
  "css_selectors": {
    "display_name": "[data-testid=\"UserName\"] span:first-child",
    "handle": "[data-testid=\"UserName\"] a",
    "bio": "[data-testid=\"UserDescription\"]",
    "followers": "a[href$=\"/verified_followers\"] span, a[href$=\"/followers\"] span",
    "following": "a[href$=\"/following\"] span",
    "location": "[data-testid=\"UserLocation\"] span",
    "join_date": "[data-testid=\"UserJoinDate\"] span",
    "website": "[data-testid=\"UserUrl\"] a",
    "verified_badge": "[data-testid=\"icon-verified\"]"
  }
}

7.Official X API vs HTML scraping — choosing the right approach

The X API v2 provides tweet lookup by ID, user lookup by handle or ID, search (recent and full-archive on paid tiers), filtered stream for real-time keyword monitoring, and timelines. For any production system that requires search, hashtag tracking, or timeline access, API licensing is the only reliable path. The engineering cost of maintaining HTML scrapers that break monthly will exceed API subscription costs for most teams within a year.

HTML scraping with OmniScrape is appropriate for specific, lower-frequency use cases: ad-hoc research on a set of known tweet IDs, one-off competitive analysis of public profile metadata, or prototyping before committing to API budget. It is not appropriate for real-time monitoring, keyword search, or hashtag tracking.

If you must automate flows that require a logged-in session (for accounts you own and operate), read headless browser scraping for session management patterns. Be aware that automating login flows on accounts you do not own violates X's terms.

For pipelines that combine multiple social platforms, social media web scraping covers cross-platform architecture patterns including rate limit management and data normalisation.

8.Deduplication: store tweet IDs, not URLs

The numeric status ID is the canonical primary key for any tweet. x.com and twitter.com both serve the same content — a tweet at twitter.com/nasa/status/123 and x.com/nasa/status/123 are identical records. Normalise all URLs to x.com/status/{id} before storage, or better, store only the numeric ID and reconstruct URLs on demand.

Account handle changes are common. The handle in a tweet permalink URL is cosmetic — x.com redirects any handle to the correct one as long as the status ID is valid. Do not use the handle as part of a composite primary key. Store handle separately as a snapshot value with a captured_at timestamp.

Deleted tweets return a soft 404 — X renders a 'this tweet is unavailable' message rather than an HTTP 404 status in many cases. Detect deletion by checking for the absence of [data-testid="tweetText"] after a successful page load and mark the record as deleted_at in your database rather than removing it. Retaining deleted tweet metadata (ID, author, timestamp) is valuable for gap analysis in research datasets.

For high-volume pipelines, implement a seen-IDs bloom filter or a Redis SET of processed tweet IDs before hitting your primary database. Tweet IDs are monotonically increasing (Snowflake format) — you can use ID range comparisons to efficiently identify new tweets without full table scans.

9.X Terms of Service and legal considerations

X's Terms of Service explicitly restrict scraping and require API use for most forms of automated data collection. Section 4 of the Developer Agreement prohibits reverse engineering the platform and collecting data outside of official API access. These restrictions have been enforced — X has pursued legal action against large-scale scrapers.

Beyond X's own terms, storing public tweets at scale may implicate EU GDPR (tweets contain personal data), the EU DSA (platform data access obligations apply to researchers under specific conditions), and CCPA for California residents. Academic researchers in the EU may access X data under DSA Article 40 researcher access provisions — consult your institution's legal team.

OmniScrape provides the technical capability to fetch public web pages. It does not grant any rights to X's data, does not indemnify users against X's terms enforcement, and does not constitute legal advice. Engage your legal team before building any production system that stores X data at scale.

Frequently asked questions

Can I scrape X search results without logging in?

No, not reliably. Since mid-2023, X enforces a login wall on search results server-side — a residential proxy and headless browser will render the login modal, not search results. The only reliable way to access search is through the official X API v2 search endpoint on a paid tier. For monitoring specific accounts, scraping individual tweet permalinks by known status ID is a more viable logged-out approach.

Why does my X scrape return a login modal instead of tweet content?

X is serving the login wall for that URL and IP combination. Check three things: (1) confirm you are using a residential proxy — datacenter IPs are blocked more aggressively; (2) confirm you are targeting an individual status permalink (/status/{id}), not a search or timeline URL; (3) confirm js_rendering mode is set, since tweet content requires JavaScript execution. Even with all three correct, success is not guaranteed — X's logged-out rendering is intentionally inconsistent. If body.data.css_extracted returns empty values for tweetText, treat it as a login-wall hit and back off before retrying.

Is Nitter a viable alternative to direct X.com scraping?

No. X rate-limited Nitter instances into failure in early 2024 by restricting the guest token API that Nitter relied on. The few remaining public instances are unreliable and serve stale or partial data. Do not build production pipelines on Nitter. Self-hosted Nitter instances face the same guest token restrictions.

How do I track hashtags or keywords at scale?

Use the X API v2 filtered stream (available on Basic tier and above) for real-time keyword and hashtag monitoring. For historical search, the full-archive search endpoint is available on Pro and Enterprise tiers. HTML scraping hashtag pages is not viable — they are login-gated and the volume of updates makes polling impractical even if you could access them.

What is the X API pricing structure for search access?

X API pricing changes frequently — check developer.x.com for current rates. As of the guide's writing, Basic tier provides limited monthly tweet reads and recent search access; Pro tier provides higher volume and full-archive search; Enterprise tier provides firehose and custom volume. OmniScrape does not resell X API access — this guide covers HTML scraping only.

How do I handle tweet deletions in my dataset?

Deleted tweets typically return a page that renders 'this tweet is unavailable' rather than an HTTP 404. Detect deletion by checking for the absence of [data-testid="tweetText"] after a confirmed page load (body.success true, page loaded, but no tweet text extracted). Mark the record with a deleted_at timestamp in your database. Do not delete the row — retaining the tweet ID, author ID, and original timestamp is valuable for dataset integrity and gap analysis.

Can I scrape follower lists or following lists from profiles?

No. Follower and following list pages (/followers, /following) require login and are not accessible to logged-out scrapers. You can capture a snapshot of the follower and following counts from the profile header page (which sometimes renders logged-out), but not the individual accounts in those lists. For follower graph data, the X API v2 followers/following lookup endpoints are the only viable option.

Related guides

  • Social Media Web Scraping: Brand Mention Monitoring from Public Pages
  • Sentiment Analysis Web Scraping: Build a Production Review Pipeline
  • Headless Browser Scraping: When to Use It and How to Do It Right
  • Web Scraping Without Getting Blocked

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

Ready to get started?

Start scraping protected sites today — no credit card required.

OmniScrape

Web scraping infrastructure for developers. One API call to bypass any protection.

All systems operational

Product

  • Web Unlocker
  • Browser-as-a-Service
  • Residential Proxies
  • Pricing

Developers

  • API Reference ↗
  • Quickstart ↗
  • All Guides
  • Use Cases
  • Status

Company

  • About
  • Contact

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • Cookie Policy
  • Acceptable Use

Solutions

  • E-commerce Web Scraping: Catalog Intelligence at Production Scale
  • Real Estate Web Scraping: Listings, Comps, and Market Data
  • SERP Web Scraping: Agency Rank Tracking Workflow
  • Job Board Web Scraping: HR Tech Pipeline for Labor Market Intelligence
  • Price Monitoring with Web Scraping: A Practical Developer Guide
  • Lead Generation Web Scraping: Compliant Inbound Enrichment for Sales Teams
  • Market Research Web Scraping: Multi-Geo Data Collection for Research Firms
  • Sentiment Analysis Web Scraping: Build a Production Review Pipeline
  • Logistics Web Scraping: Carrier Rates, Port ETAs, and Sailing Schedules
  • Social Media Web Scraping: Brand Mention Monitoring from Public Pages
  • LLM Training Data Scraping: Building Clean Web Corpora
  • Travel Web Scraping: Hotel Rates, Flight Fares & Parity Monitoring

Web Scraping by Language

  • Web Scraping with Python
  • Web Scraping with Node.js: fetch, Cheerio, and the OmniScrape API
  • Web Scraping with Java: HttpClient, Jsoup, and OmniScrape API
  • Web Scraping with PHP
  • Web Scraping with Go (Golang)
  • Web Scraping with Ruby: Faraday, Nokogiri, Sidekiq & OmniScrape
  • Web Scraping with C#: HttpClient, AngleSharp, and OmniScrape API
  • Web Scraping with Rust
  • Web Scraping with R: httr2, rvest, and the OmniScrape API
  • Web Scraping with C++
  • Web Scraping with Elixir
  • Web Scraping with Perl: Mojo::UserAgent, Mojo::DOM, and OmniScrape

Anti-Bot Bypass

  • How to Bypass Cloudflare When Web Scraping
  • How to Bypass DataDome When Web Scraping
  • How to Bypass Akamai Bot Manager When Web Scraping
  • How to Bypass PerimeterX (HUMAN Security) When Web Scraping
  • Bypassing AWS WAF When Web Scraping: Rate Rules, Bot Control, and Residential Proxies
  • How to Bypass Imperva (Incapsula) When Web Scraping
  • How to Bypass Kasada Bot Protection When Web Scraping
  • How to Bypass F5 BIG-IP Bot Defense When Web Scraping
  • How to Bypass Distil Networks When Web Scraping
  • How to Bypass reCAPTCHA When Web Scraping

Scraping Tools

  • Playwright Web Scraping: Practical Patterns for Protected Sites
  • Puppeteer Web Scraping: Patterns, Anti-Bot Limits, and BaaS Integration
  • Selenium Web Scraping: Practical Patterns for Real-World Projects
  • Scrapy Web Scraping with OmniScrape: Download Middleware, Pipelines, and Scale
  • Beautiful Soup Web Scraping: A Practical Guide
  • cURL Web Scraping: Shell-Native Patterns with OmniScrape
  • HTTPX Web Scraping: Async Python with OmniScrape
  • Cheerio Web Scraping: A Practical Guide

Site-Specific Scrapers

  • Amazon Scraper: Product Data, Buy Box, Reviews, and Multi-Marketplace
  • Google Search Scraper: Extract SERP Rankings and Features
  • Google Maps Scraper: Extract Business Listings and Place Data
  • LinkedIn Scraper: Companies, Jobs, and Public Profiles
  • Walmart Scraper: Prices, Stock, Rollback Deals, and Fulfillment Data
  • eBay Scraper: Extract Listings, Auctions, and Sold Prices
  • Shopify Scraper: Products, Variants, and JSON Endpoints
  • Indeed Scraper: Extract Job Listings, Salaries, and Company Data
  • Zillow Scraper: Extract Listings, Zestimates, and Price History
  • Reddit Scraper: Posts, Comments, and Subreddit Data
  • X (Twitter) Scraper: Tweets, Profiles, and Hashtags
  • Instagram Scraper: Posts, Reels, and Profile Metrics
  • TikTok Scraper: Extract Videos, Hashtags, and Trend Data
  • YouTube Scraper: Extract Video Metadata, Comments, and Channel Stats
  • Booking.com Scraper: Hotel Rates, Room Types, and Availability
  • Airbnb Scraper: Listings, Calendars, and Nightly Rates
  • Crunchbase Scraper: Extract Funding Rounds, Companies, and Investors
  • Yelp Scraper: Extract Business Listings, Ratings, and Reviews
  • Glassdoor Scraper: Employer Ratings, Salaries, and Review Data
  • Trustpilot Scraper: TrustScore, Star Distribution, and Review Monitoring

How We Compare

  • OmniScrape vs ScrapingBee
  • OmniScrape vs ZenRows
  • OmniScrape vs ScraperAPI: A Practical Developer Comparison
  • OmniScrape vs Bright Data: Which Web Scraping Platform Fits Your Team?
  • OmniScrape vs Oxylabs
  • OmniScrape vs Smartproxy
  • OmniScrape vs Crawlbase: API Design, Observability, and Migration Guide
  • OmniScrape vs Apify

Web Scraping Guides

  • Web Scraping Without Getting Blocked
  • Web Scraping Proxy Guide: Types, Sessions, Geo, and OmniScrape Integration
  • Solve CAPTCHAs While Web Scraping
  • Web Scraping vs Web Crawling: Architecture, Patterns, and When to Use Each
  • Headless Browser Scraping: When to Use It and How to Do It Right
  • Web Scraping API: Endpoint, Modes, Output Formats & Integration Patterns
  • Rotating Proxies for Web Scraping: Policies, Session Binding, and Geo Pools
  • Scrape JavaScript-Rendered Pages: SPAs, Hydration, and Hidden APIs

© 2026 OmniScrape. All rights reserved.

PrivacyTermsRefundsAcceptable Use