Lanshanbot

Bot User-Agent: lanshanbot

🤖 Overview

Lanshanbot is a web crawler operated by ByteDance Inc., the parent company of TikTok (Douyin) and Toutiao, first documented in official ByteDance crawler guidelines published in 2022. Its primary purpose is to collect publicly accessible web content for indexing in ByteDance’s search engines, including Toutiao search and Douyin search, and to feed data into product recommendation systems and language model training pipelines for ByteDance’s AI research division, Volcano Engine. The bot is explicitly intended to support the company’s core products by enabling fresh, relevant content discovery at scale.

🌐 Technical Behavior

Lanshanbot initiates HTTP/1.1 and HTTP/2 requests with a configurable crawl delay typically set between 1 and 10 seconds, though observed bursts can approach 50 requests per second from certain IP pools. Its crawls originate from IP ranges listed in ByteDance’s official IP whitelist, which as of 2024 includes blocks such as 45.254.132.0/24 and 103.234.152.0/23, documented in the company’s ByteDance Crawler IP Ranges page. The bot respects the Crawl-Delay directive in robots.txt and pauses between requests accordingly, but it does not honor the Request-Rate header. It fetches pages using both GET and conditional GET (If-Modified-Since, If-None-Match) to reduce bandwidth consumption. According to reverse‑engineering analyses published on GitHub (e.g., lanshanbot-ua-list), the crawler follows standard link traversal, parsing sitemaps and following redirects, but avoids crawling URLs containing # fragments or javascript: schemes.

📋 robots.txt Compliance

Based on official ByteDance documentation, Lanshanbot fully supports the Robots Exclusion Standard, including both Disallow and Allow directives, and respects per‑path exclusions. Multiple webmaster forums report that the crawler pauses when encountering a Crawl-Delay directive, with a granularity of whole seconds. No documented instances of Lanshanbot ignoring robots.txt rules have been reported in security advisories as of early 2025.

🔍 Detection Indicators

The primary User‑Agent string is Mozilla/5.0 (compatible; Lanshanbot/1.0; +https://lanshan.bytedance.com/crawler), though variants exist for mobile and desktop emulation. Behavioral fingerprints include a high proportion of requests with the Accept-Encoding: gzip, deflate, br header and a Connection: keep-alive header, and a lack of typical browser JavaScript or WebSocket usage. The crawler also sends a From header containing an email address (e.g., [email protected]) on some requests, as specified in official guidelines. Security analysts note that Lanshanbot’s IPs are verified against ByteDance’s published ASN (AS136907).

📊 Data Usage

All data collected by Lanshanbot is used internally by ByteDance for search indexing on Toutiao and Douyin, for training machine‑learning models—particularly natural language understanding and recommendation algorithms—and for improving content relevance in the company’s news aggregation and short‑video platforms. According to ByteDance’s privacy policy, the crawled content is not sold to third parties but may be used to train generative AI models, including those powering Douyin’s search assistant.

⚙️ Rate Limiting Policy

Because Lanshanbot can exhibit highly variable crawl rates—occasionally exceeding 100 requests per second from a single IP range—rate‑limiting is recommended to protect server resources from sudden load spikes. A threshold‑based block (e.g., 100 requests per 10 seconds from a single IP) is the standard policy rationale, balancing legitimate content discovery with site stability.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.