TikTok Scraper: Extract Videos, Hashtags, and Trend Data

1.TikTok data fields worth extracting

The most valuable signals are engagement ratios (likes-to-plays, comments-to-plays) and velocity — how fast counts are climbing relative to video age. Raw counts without timestamps are far less useful for trend detection.

Below is the full set of fields available from TikTok web pages, either from embedded JSON or rendered DOM. Not all fields are present on every page type — hashtag pages omit per-video durations, for example.

Video ID, canonical URL, and short-link redirect target
Video description text, inline hashtags, and @mentions
Play count, like count, comment count, share count, and collect count
Video create timestamp and duration in seconds
Music/sound ID, sound title, and original author flag
Creator username, display name, follower count, and verified status
Hashtag challenge aggregate view count and video count
Effect and sticker metadata when attached to a video
Region code and content language signals from video metadata
Cover image URL and dynamic cover (animated thumbnail) URL

2.TikTok web URL patterns

TikTok canonicalizes video URLs to the `/@username/video/VIDEO_ID` format. Short links (`vm.tiktok.com`) redirect to the canonical form — OmniScrape follows redirects automatically, so you can pass either form. The final URL after redirect is returned in the response metadata.

Hashtag pages and sound pages are the two other high-value targets. Search pages (`/search?q=`) render results but paginate client-side, making them harder to scrape at volume without session continuity.

Video (canonical): https://www.tiktok.com/@charlidamelio/video/7234567890123456789
Short link (auto-followed): https://vm.tiktok.com/ZMhABC123/
Hashtag challenge: https://www.tiktok.com/tag/fyp
Sound page: https://www.tiktok.com/music/original-sound-7234567890123456789
Creator profile: https://www.tiktok.com/@nike
Search results: https://www.tiktok.com/search?q=scraping
Discover (trending): https://www.tiktok.com/explore

3.SIGI_STATE and __UNIVERSAL_DATA_FOR_REHYDRATION__

TikTok's web app is a React SPA that bootstraps from a server-rendered JSON payload embedded in a `<script>` tag. Older builds used `<script id="SIGI_STATE" type="application/json">`. Current builds use `<script id="__UNIVERSAL_DATA_FOR_REHYDRATION__" type="application/json">`. The payload structure differs, but both contain an `ItemModule` map keyed by video ID, with nested `stats`, `author`, `music`, and `video` objects.

When this JSON is present, it is far more reliable than CSS extraction: counts are exact integers, timestamps are Unix epoch, and music metadata is structured. Parse the script tag content with a regex or an HTML parser, then `JSON.parse()` the inner text. Navigate to `webapp.video-detail.itemInfo.itemStruct` (newer schema) or `ItemModule[videoId]` (older schema).

When the script tag exists but its content is an empty object (`{}`), or the tag is absent entirely, ByteDance's bot scoring has flagged the request. The fix is always the same: escalate to `js_rendering` mode with a residential proxy in the target region. The headless browser executes TikTok's JavaScript, which populates the script tag after client-side hydration.

4.Visible DOM selectors as fallback

When embedded JSON is unavailable or you want to validate counts against rendered output, TikTok's `data-e2e` attributes are the most stable CSS hooks. The platform uses these internally for end-to-end test automation, so they change less frequently than class names or element hierarchy.

Key selectors on a video detail page: `strong[data-e2e="like-count"]`, `strong[data-e2e="comment-count"]`, `strong[data-e2e="share-count"]`, `strong[data-e2e="undefined-count"]` (collect/bookmark count), `div[data-e2e="browse-video-desc"]` for the description, and `span[data-e2e="browse-username"]` for the creator handle.

On hashtag pages, `h1[data-e2e="challenge-title"]` holds the hashtag name and `strong[data-e2e="challenge-view-count"]` holds the aggregate view count. Individual video cards in the grid use `div[data-e2e="challenge-item"]` as container elements.

These selectors break when TikTok ships a redesign — treat them as likely-stable rather than guaranteed. Cross-check against the embedded JSON whenever possible.

5.ByteDance bot detection: what you are up against

TikTok does not use Cloudflare or a third-party WAF for its primary bot defense — ByteDance runs its own detection stack. It operates at multiple layers: IP reputation scoring at the edge, browser fingerprint validation in JavaScript, behavioral analysis of request timing, and content gating based on bot confidence score.

The most common symptom is an HTML response that looks correct structurally — the page loads, the layout renders — but all engagement counts are zero or missing and the embedded JSON is empty. This is a deliberate degraded response, not a network error. A 200 status code does not mean you got real data.

Datacenter IPs are blocked at the IP reputation layer before JavaScript even runs. Residential proxies are mandatory. Region matters: a US residential proxy will not unblock a video that is geo-restricted to Southeast Asia — match the proxy region to the content region.

Empty or missing SIGI_STATE / __UNIVERSAL_DATA_FOR_REHYDRATION__ for bot traffic
Datacenter and hosting ASN blocks at the edge
Geo-restricted content returning 'This video is not available in your region'
CAPTCHA challenges on search, discover, and high-frequency hashtag pagination
Frequent JSON schema changes breaking field paths (plan for schema drift)
Session-based rate limits that tighten after repeated requests from the same IP
JavaScript fingerprinting that detects headless browsers without stealth patches

6.Scraping a TikTok video page

Use `js_rendering` mode with a US residential proxy. Set `js_wait_selector` to a `data-e2e` count attribute so the request waits until TikTok's JavaScript has populated the DOM before returning HTML. A 15-second timeout is sufficient for most video pages; slow connections or heavy videos may need more.

The `css_extractor` output format returns structured fields directly in `body.data.css_extracted`, saving you a parsing step for the visible DOM values. If you also need the raw HTML to extract embedded JSON, switch `output_format` to `html` and parse `body.data.content` in your worker.

After receiving the response, check `body.data.css_extracted.likes` — if the value is an empty string, the page rendered without counts and you should retry with a different proxy endpoint.

TikTok video page — js_rendering with CSS extraction

json

1234567891011121314151617{
  "url": "https://www.tiktok.com/@nike/video/7234567890123456789",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "js_wait_selector": "[data-e2e=\"like-count\"]",
  "js_wait_timeout": 15000,
  "css_selectors": {
    "description": "[data-e2e=\"browse-video-desc\"]",
    "likes": "strong[data-e2e=\"like-count\"]",
    "comments": "strong[data-e2e=\"comment-count\"]",
    "shares": "strong[data-e2e=\"share-count\"]",
    "author": "span[data-e2e=\"browse-username\"]",
    "music": "[data-e2e=\"browse-music\"]",
    "video_id": "meta[property=\"og:url\"]"
  }
}

7.Scraping a TikTok hashtag challenge page

Hashtag pages aggregate total views across all videos using that tag and list top-performing videos in a grid. They are among the most heavily bot-scrutinized endpoints on TikTok — keep request frequency low and distribute across proxy sessions.

Wait on `[data-e2e="challenge-title"]` to confirm the page has hydrated. The `video_links` selector captures all `href` values from anchor tags pointing to video URLs, giving you a list of video IDs to enqueue for individual video requests.

Hashtag pages paginate client-side via infinite scroll. To retrieve more than the initial grid (typically 12–20 videos), you would need scroll simulation, which requires a stateful `session_id` across multiple requests — a more complex pipeline than single-page extraction.

TikTok hashtag challenge page — js_rendering with CSS extraction

json

123456789101112131415{
  "url": "https://www.tiktok.com/tag/fyp",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "js_wait_selector": "[data-e2e=\"challenge-title\"]",
  "js_wait_timeout": 15000,
  "css_selectors": {
    "hashtag": "h1[data-e2e=\"challenge-title\"]",
    "views": "strong[data-e2e=\"challenge-view-count\"]",
    "video_count": "[data-e2e=\"challenge-video-count\"]",
    "video_links": "a[href*=\"/video/\"]",
    "top_creator_links": "a[href*=\"/@\"]"
  }
}

8.Extracting embedded JSON from the HTML response

When you need the full structured payload — precise integer counts, Unix timestamps, sound metadata, region codes — request `output_format: html` and parse `body.data.content` in your processing pipeline. The embedded JSON contains far more fields than visible DOM selectors expose.

Target the script tag by ID. In Python with BeautifulSoup: `soup.find('script', {'id': '__UNIVERSAL_DATA_FOR_REHYDRATION__'})`. Extract `.string`, then `json.loads()`. The top-level key path for video detail pages is typically `__DEFAULT_SCOPE__['webapp.video-detail']['itemInfo']['itemStruct']`. Older pages may still use `ItemModule` at the top level — write your parser to try both paths.

Key fields inside `itemStruct`: `id` (video ID string), `desc` (description with hashtags inline), `createTime` (Unix timestamp), `stats` (object with `playCount`, `diggCount`, `commentCount`, `shareCount`, `collectCount`), `music` (object with `id`, `title`, `authorName`, `original`), `author` (object with `uniqueId`, `nickname`, `followerCount`, `verified`), `video` (object with `duration`, `cover`, `dynamicCover`).

Schema drift is real — ByteDance ships frontend changes without versioning the JSON structure. Build your parser defensively: use `.get()` with defaults in Python, optional chaining in TypeScript, and log raw payloads when expected fields are missing so you can update field paths quickly. See web scraping with Python for general parsing patterns.

9.TikTok Terms of Service and legal considerations

TikTok's Terms of Service prohibit unauthorized automated data collection. ByteDance actively enforces this technically (bot detection) and legally (cease-and-desist letters to commercial scrapers). Before building a TikTok data pipeline, assess your use case against the terms and applicable law.

Academic researchers should evaluate TikTok's Research API, which provides structured access to public data for qualified institutions without the legal exposure of scraping. Commercial use cases — competitive intelligence, trend monitoring for brands, influencer analytics — require either official data licensing through ByteDance partnerships or licensed data from authorized providers.

Do not scrape content from accounts belonging to minors for commercial profiling. Regional regulations add additional constraints: GDPR in the EU governs processing of personal data visible in public posts; US state privacy laws (CCPA and successors) apply to California residents' data. Content that is public on TikTok is not automatically free of copyright — video content and music are separately protected.

Metadata scraping (counts, timestamps, hashtags) sits in a different legal category than downloading video files. Video CDN URLs in TikTok's API responses carry short-lived cryptographic signatures and are subject to copyright. Scrape metadata; do not download and redistribute video content without explicit rights.

Frequently asked questions

Where did SIGI_STATE go on TikTok?

ByteDance migrated the embedded JSON payload from a script tag with id='SIGI_STATE' to one with id='__UNIVERSAL_DATA_FOR_REHYDRATION__'. The internal structure also changed — the video detail path is now nested under __DEFAULT_SCOPE__['webapp.video-detail']['itemInfo']['itemStruct'] rather than at the top-level ItemModule key. Additionally, TikTok omits or empties this payload for requests that score as bot traffic. The fix is js_rendering mode with a residential proxy — the headless browser executes TikTok's JavaScript, which populates the script tag after hydration.

Why are play counts and like counts missing from the response?

Empty counts almost always mean TikTok served a degraded bot response — the page structure is present but counts are withheld. Three things to check: (1) Are you using a datacenter IP? Switch to residential. (2) Is the video geo-restricted? Try a proxy in the creator's region. (3) Are you using fast or auto mode without js_rendering? TikTok counts require JavaScript execution to populate. Switch to js_rendering with js_wait_selector pointing to a data-e2e count attribute.

Can I scrape TikTok video download URLs?

TikTok video CDN URLs are embedded in the page JSON and carry expiring cryptographic signatures — they stop working within minutes to hours. Downloading TikTok videos and redistributing them violates both TikTok's Terms of Service and copyright law (the video creator holds rights to the content; the music is separately licensed). Scrape metadata only unless you have explicit rights from both the creator and the music rights holder.

How do I scrape more than the initial 12–20 videos from a hashtag page?

Hashtag pages paginate via infinite scroll, which is driven by client-side JavaScript. To retrieve additional pages you need to simulate scrolling within a persistent browser session. Use session_id in your OmniScrape requests to maintain session state across multiple calls, and inject scroll events between requests. This is significantly more complex than single-page extraction and increases detection risk — keep inter-request delays realistic (several seconds minimum).

What proxy region should I use for TikTok?

Match the proxy region to the content region. For US-targeted content and global trending hashtags, residential:us is appropriate. For region-specific content — Southeast Asian creators, EU-only sounds — use a proxy in that region. TikTok's geo-restriction is applied at the edge based on the requesting IP's location, not based on any header you can spoof.

Is there an official TikTok API I should use instead?

TikTok offers two official APIs: the TikTok Research API (for qualified academic researchers, application required, covers public data) and the TikTok for Developers platform (for building apps that post or read data on behalf of authenticated users, not suitable for bulk public data collection). Commercial trend monitoring and influencer analytics products are not covered by either — those use cases require a data licensing agreement with ByteDance or a licensed third-party data provider.

How often does TikTok's JSON schema change?

ByteDance ships frontend changes continuously without versioning the embedded JSON structure. Field paths shift several times per year. Build your parser defensively: use optional chaining or .get() with fallback defaults, log the raw payload whenever expected fields are absent, and set up alerting when your extracted counts drop to zero across multiple videos simultaneously (a reliable signal that a schema change has broken your field paths rather than that TikTok is blocking you).

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.

1.TikTok data fields worth extracting

Video ID, canonical URL, and short-link redirect target
Video description text, inline hashtags, and @mentions
Play count, like count, comment count, share count, and collect count
Video create timestamp and duration in seconds
Music/sound ID, sound title, and original author flag
Creator username, display name, follower count, and verified status
Hashtag challenge aggregate view count and video count
Effect and sticker metadata when attached to a video
Region code and content language signals from video metadata
Cover image URL and dynamic cover (animated thumbnail) URL

2.TikTok web URL patterns

Video (canonical): https://www.tiktok.com/@charlidamelio/video/7234567890123456789
Short link (auto-followed): https://vm.tiktok.com/ZMhABC123/
Hashtag challenge: https://www.tiktok.com/tag/fyp
Sound page: https://www.tiktok.com/music/original-sound-7234567890123456789
Creator profile: https://www.tiktok.com/@nike
Search results: https://www.tiktok.com/search?q=scraping
Discover (trending): https://www.tiktok.com/explore

3.SIGI_STATE and __UNIVERSAL_DATA_FOR_REHYDRATION__

4.Visible DOM selectors as fallback

These selectors break when TikTok ships a redesign — treat them as likely-stable rather than guaranteed. Cross-check against the embedded JSON whenever possible.

5.ByteDance bot detection: what you are up against

Empty or missing SIGI_STATE / __UNIVERSAL_DATA_FOR_REHYDRATION__ for bot traffic
Datacenter and hosting ASN blocks at the edge
Geo-restricted content returning 'This video is not available in your region'
CAPTCHA challenges on search, discover, and high-frequency hashtag pagination
Frequent JSON schema changes breaking field paths (plan for schema drift)
Session-based rate limits that tighten after repeated requests from the same IP
JavaScript fingerprinting that detects headless browsers without stealth patches

6.Scraping a TikTok video page

After receiving the response, check `body.data.css_extracted.likes` — if the value is an empty string, the page rendered without counts and you should retry with a different proxy endpoint.

TikTok video page — js_rendering with CSS extraction

json

1234567891011121314151617{
  "url": "https://www.tiktok.com/@nike/video/7234567890123456789",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "js_wait_selector": "[data-e2e=\"like-count\"]",
  "js_wait_timeout": 15000,
  "css_selectors": {
    "description": "[data-e2e=\"browse-video-desc\"]",
    "likes": "strong[data-e2e=\"like-count\"]",
    "comments": "strong[data-e2e=\"comment-count\"]",
    "shares": "strong[data-e2e=\"share-count\"]",
    "author": "span[data-e2e=\"browse-username\"]",
    "music": "[data-e2e=\"browse-music\"]",
    "video_id": "meta[property=\"og:url\"]"
  }
}

7.Scraping a TikTok hashtag challenge page

TikTok hashtag challenge page — js_rendering with CSS extraction

json

123456789101112131415{
  "url": "https://www.tiktok.com/tag/fyp",
  "mode": "js_rendering",
  "output_format": "css_extractor",
  "proxy": "residential:us",
  "js_wait_selector": "[data-e2e=\"challenge-title\"]",
  "js_wait_timeout": 15000,
  "css_selectors": {
    "hashtag": "h1[data-e2e=\"challenge-title\"]",
    "views": "strong[data-e2e=\"challenge-view-count\"]",
    "video_count": "[data-e2e=\"challenge-video-count\"]",
    "video_links": "a[href*=\"/video/\"]",
    "top_creator_links": "a[href*=\"/@\"]"
  }
}

8.Extracting embedded JSON from the HTML response

9.TikTok Terms of Service and legal considerations

Frequently asked questions

Where did SIGI_STATE go on TikTok?

Why are play counts and like counts missing from the response?

Can I scrape TikTok video download URLs?

How do I scrape more than the initial 12–20 videos from a hashtag page?

What proxy region should I use for TikTok?

Is there an official TikTok API I should use instead?

How often does TikTok's JSON schema change?

Related guides

Ready to scrape without blocks?

Get your API key in minutes. Test protected URLs from the dashboard — no credit card required to start.