1.TikTok data fields worth extracting
The most valuable signals are engagement ratios (likes-to-plays, comments-to-plays) and velocity — how fast counts are climbing relative to video age. Raw counts without timestamps are far less useful for trend detection.
Below is the full set of fields available from TikTok web pages, either from embedded JSON or rendered DOM. Not all fields are present on every page type — hashtag pages omit per-video durations, for example.
- Video ID, canonical URL, and short-link redirect target
- Video description text, inline hashtags, and @mentions
- Play count, like count, comment count, share count, and collect count
- Video create timestamp and duration in seconds
- Music/sound ID, sound title, and original author flag
- Creator username, display name, follower count, and verified status
- Hashtag challenge aggregate view count and video count
- Effect and sticker metadata when attached to a video
- Region code and content language signals from video metadata
- Cover image URL and dynamic cover (animated thumbnail) URL
2.TikTok web URL patterns
TikTok canonicalizes video URLs to the `/@username/video/VIDEO_ID` format. Short links (`vm.tiktok.com`) redirect to the canonical form — OmniScrape follows redirects automatically, so you can pass either form. The final URL after redirect is returned in the response metadata.
Hashtag pages and sound pages are the two other high-value targets. Search pages (`/search?q=`) render results but paginate client-side, making them harder to scrape at volume without session continuity.
- Video (canonical): https://www.tiktok.com/@charlidamelio/video/7234567890123456789
- Short link (auto-followed): https://vm.tiktok.com/ZMhABC123/
- Hashtag challenge: https://www.tiktok.com/tag/fyp
- Sound page: https://www.tiktok.com/music/original-sound-7234567890123456789
- Creator profile: https://www.tiktok.com/@nike
- Search results: https://www.tiktok.com/search?q=scraping
- Discover (trending): https://www.tiktok.com/explore
3.SIGI_STATE and __UNIVERSAL_DATA_FOR_REHYDRATION__
TikTok's web app is a React SPA that bootstraps from a server-rendered JSON payload embedded in a `<script>` tag. Older builds used `<script id="SIGI_STATE" type="application/json">`. Current builds use `<script id="__UNIVERSAL_DATA_FOR_REHYDRATION__" type="application/json">`. The payload structure differs, but both contain an `ItemModule` map keyed by video ID, with nested `stats`, `author`, `music`, and `video` objects.
When this JSON is present, it is far more reliable than CSS extraction: counts are exact integers, timestamps are Unix epoch, and music metadata is structured. Parse the script tag content with a regex or an HTML parser, then `JSON.parse()` the inner text. Navigate to `webapp.video-detail.itemInfo.itemStruct` (newer schema) or `ItemModule[videoId]` (older schema).
When the script tag exists but its content is an empty object (`{}`), or the tag is absent entirely, ByteDance's bot scoring has flagged the request. The fix is always the same: escalate to `js_rendering` mode with a residential proxy in the target region. The headless browser executes TikTok's JavaScript, which populates the script tag after client-side hydration.
4.Visible DOM selectors as fallback
When embedded JSON is unavailable or you want to validate counts against rendered output, TikTok's `data-e2e` attributes are the most stable CSS hooks. The platform uses these internally for end-to-end test automation, so they change less frequently than class names or element hierarchy.
Key selectors on a video detail page: `strong[data-e2e="like-count"]`, `strong[data-e2e="comment-count"]`, `strong[data-e2e="share-count"]`, `strong[data-e2e="undefined-count"]` (collect/bookmark count), `div[data-e2e="browse-video-desc"]` for the description, and `span[data-e2e="browse-username"]` for the creator handle.
On hashtag pages, `h1[data-e2e="challenge-title"]` holds the hashtag name and `strong[data-e2e="challenge-view-count"]` holds the aggregate view count. Individual video cards in the grid use `div[data-e2e="challenge-item"]` as container elements.
These selectors break when TikTok ships a redesign — treat them as likely-stable rather than guaranteed. Cross-check against the embedded JSON whenever possible.
5.ByteDance bot detection: what you are up against
TikTok does not use Cloudflare or a third-party WAF for its primary bot defense — ByteDance runs its own detection stack. It operates at multiple layers: IP reputation scoring at the edge, browser fingerprint validation in JavaScript, behavioral analysis of request timing, and content gating based on bot confidence score.
The most common symptom is an HTML response that looks correct structurally — the page loads, the layout renders — but all engagement counts are zero or missing and the embedded JSON is empty. This is a deliberate degraded response, not a network error. A 200 status code does not mean you got real data.
Datacenter IPs are blocked at the IP reputation layer before JavaScript even runs. Residential proxies are mandatory. Region matters: a US residential proxy will not unblock a video that is geo-restricted to Southeast Asia — match the proxy region to the content region.
- Empty or missing SIGI_STATE / __UNIVERSAL_DATA_FOR_REHYDRATION__ for bot traffic
- Datacenter and hosting ASN blocks at the edge
- Geo-restricted content returning 'This video is not available in your region'
- CAPTCHA challenges on search, discover, and high-frequency hashtag pagination
- Frequent JSON schema changes breaking field paths (plan for schema drift)
- Session-based rate limits that tighten after repeated requests from the same IP
- JavaScript fingerprinting that detects headless browsers without stealth patches
6.Scraping a TikTok video page
Use `js_rendering` mode with a US residential proxy. Set `js_wait_selector` to a `data-e2e` count attribute so the request waits until TikTok's JavaScript has populated the DOM before returning HTML. A 15-second timeout is sufficient for most video pages; slow connections or heavy videos may need more.
The `css_extractor` output format returns structured fields directly in `body.data.css_extracted`, saving you a parsing step for the visible DOM values. If you also need the raw HTML to extract embedded JSON, switch `output_format` to `html` and parse `body.data.content` in your worker.
After receiving the response, check `body.data.css_extracted.likes` — if the value is an empty string, the page rendered without counts and you should retry with a different proxy endpoint.
1234567891011121314151617{
"url": "https://www.tiktok.com/@nike/video/7234567890123456789",
"mode": "js_rendering",
"output_format": "css_extractor",
"proxy": "residential:us",
"js_wait_selector": "[data-e2e=\"like-count\"]",
"js_wait_timeout": 15000,
"css_selectors": {
"description": "[data-e2e=\"browse-video-desc\"]",
"likes": "strong[data-e2e=\"like-count\"]",
"comments": "strong[data-e2e=\"comment-count\"]",
"shares": "strong[data-e2e=\"share-count\"]",
"author": "span[data-e2e=\"browse-username\"]",
"music": "[data-e2e=\"browse-music\"]",
"video_id": "meta[property=\"og:url\"]"
}
}
7.Scraping a TikTok hashtag challenge page
Hashtag pages aggregate total views across all videos using that tag and list top-performing videos in a grid. They are among the most heavily bot-scrutinized endpoints on TikTok — keep request frequency low and distribute across proxy sessions.
Wait on `[data-e2e="challenge-title"]` to confirm the page has hydrated. The `video_links` selector captures all `href` values from anchor tags pointing to video URLs, giving you a list of video IDs to enqueue for individual video requests.
Hashtag pages paginate client-side via infinite scroll. To retrieve more than the initial grid (typically 12–20 videos), you would need scroll simulation, which requires a stateful `session_id` across multiple requests — a more complex pipeline than single-page extraction.
123456789101112131415{
"url": "https://www.tiktok.com/tag/fyp",
"mode": "js_rendering",
"output_format": "css_extractor",
"proxy": "residential:us",
"js_wait_selector": "[data-e2e=\"challenge-title\"]",
"js_wait_timeout": 15000,
"css_selectors": {
"hashtag": "h1[data-e2e=\"challenge-title\"]",
"views": "strong[data-e2e=\"challenge-view-count\"]",
"video_count": "[data-e2e=\"challenge-video-count\"]",
"video_links": "a[href*=\"/video/\"]",
"top_creator_links": "a[href*=\"/@\"]"
}
}
8.Extracting embedded JSON from the HTML response
When you need the full structured payload — precise integer counts, Unix timestamps, sound metadata, region codes — request `output_format: html` and parse `body.data.content` in your processing pipeline. The embedded JSON contains far more fields than visible DOM selectors expose.
Target the script tag by ID. In Python with BeautifulSoup: `soup.find('script', {'id': '__UNIVERSAL_DATA_FOR_REHYDRATION__'})`. Extract `.string`, then `json.loads()`. The top-level key path for video detail pages is typically `__DEFAULT_SCOPE__['webapp.video-detail']['itemInfo']['itemStruct']`. Older pages may still use `ItemModule` at the top level — write your parser to try both paths.
Key fields inside `itemStruct`: `id` (video ID string), `desc` (description with hashtags inline), `createTime` (Unix timestamp), `stats` (object with `playCount`, `diggCount`, `commentCount`, `shareCount`, `collectCount`), `music` (object with `id`, `title`, `authorName`, `original`), `author` (object with `uniqueId`, `nickname`, `followerCount`, `verified`), `video` (object with `duration`, `cover`, `dynamicCover`).
Schema drift is real — ByteDance ships frontend changes without versioning the JSON structure. Build your parser defensively: use `.get()` with defaults in Python, optional chaining in TypeScript, and log raw payloads when expected fields are missing so you can update field paths quickly. See web scraping with Python for general parsing patterns.
9.TikTok Terms of Service and legal considerations
TikTok's Terms of Service prohibit unauthorized automated data collection. ByteDance actively enforces this technically (bot detection) and legally (cease-and-desist letters to commercial scrapers). Before building a TikTok data pipeline, assess your use case against the terms and applicable law.
Academic researchers should evaluate TikTok's Research API, which provides structured access to public data for qualified institutions without the legal exposure of scraping. Commercial use cases — competitive intelligence, trend monitoring for brands, influencer analytics — require either official data licensing through ByteDance partnerships or licensed data from authorized providers.
Do not scrape content from accounts belonging to minors for commercial profiling. Regional regulations add additional constraints: GDPR in the EU governs processing of personal data visible in public posts; US state privacy laws (CCPA and successors) apply to California residents' data. Content that is public on TikTok is not automatically free of copyright — video content and music are separately protected.
Metadata scraping (counts, timestamps, hashtags) sits in a different legal category than downloading video files. Video CDN URLs in TikTok's API responses carry short-lived cryptographic signatures and are subject to copyright. Scrape metadata; do not download and redistribute video content without explicit rights.
Frequently asked questions
Where did SIGI_STATE go on TikTok?
ByteDance migrated the embedded JSON payload from a script tag with id='SIGI_STATE' to one with id='__UNIVERSAL_DATA_FOR_REHYDRATION__'. The internal structure also changed — the video detail path is now nested under __DEFAULT_SCOPE__['webapp.video-detail']['itemInfo']['itemStruct'] rather than at the top-level ItemModule key. Additionally, TikTok omits or empties this payload for requests that score as bot traffic. The fix is js_rendering mode with a residential proxy — the headless browser executes TikTok's JavaScript, which populates the script tag after hydration.
Why are play counts and like counts missing from the response?
Empty counts almost always mean TikTok served a degraded bot response — the page structure is present but counts are withheld. Three things to check: (1) Are you using a datacenter IP? Switch to residential. (2) Is the video geo-restricted? Try a proxy in the creator's region. (3) Are you using fast or auto mode without js_rendering? TikTok counts require JavaScript execution to populate. Switch to js_rendering with js_wait_selector pointing to a data-e2e count attribute.
Can I scrape TikTok video download URLs?
TikTok video CDN URLs are embedded in the page JSON and carry expiring cryptographic signatures — they stop working within minutes to hours. Downloading TikTok videos and redistributing them violates both TikTok's Terms of Service and copyright law (the video creator holds rights to the content; the music is separately licensed). Scrape metadata only unless you have explicit rights from both the creator and the music rights holder.
How do I scrape more than the initial 12–20 videos from a hashtag page?
Hashtag pages paginate via infinite scroll, which is driven by client-side JavaScript. To retrieve additional pages you need to simulate scrolling within a persistent browser session. Use session_id in your OmniScrape requests to maintain session state across multiple calls, and inject scroll events between requests. This is significantly more complex than single-page extraction and increases detection risk — keep inter-request delays realistic (several seconds minimum).
What proxy region should I use for TikTok?
Match the proxy region to the content region. For US-targeted content and global trending hashtags, residential:us is appropriate. For region-specific content — Southeast Asian creators, EU-only sounds — use a proxy in that region. TikTok's geo-restriction is applied at the edge based on the requesting IP's location, not based on any header you can spoof.
Is there an official TikTok API I should use instead?
TikTok offers two official APIs: the TikTok Research API (for qualified academic researchers, application required, covers public data) and the TikTok for Developers platform (for building apps that post or read data on behalf of authenticated users, not suitable for bulk public data collection). Commercial trend monitoring and influencer analytics products are not covered by either — those use cases require a data licensing agreement with ByteDance or a licensed third-party data provider.
How often does TikTok's JSON schema change?
ByteDance ships frontend changes continuously without versioning the embedded JSON structure. Field paths shift several times per year. Build your parser defensively: use optional chaining or .get() with fallback defaults, log the raw payload whenever expected fields are absent, and set up alerting when your extracted counts drop to zero across multiple videos simultaneously (a reliable signal that a schema change has broken your field paths rather than that TikTok is blocking you).
Related guides