webgather

Bot User-Agent: webgather

🤖 Overview

WebGather is a legitimate web crawler operated by Sogou Inc., a Chinese search engine company owned by Tencent, designed to index publicly accessible web content for Sogou Search and Sogou News. It was first deployed around 2005 and has been continuously updated to support modern web standards including HTTP/2 and JavaScript rendering. According to Sogou’s official webmaster documentation at http://www.sogou.com/docs/help/webmasters.htm, the crawler aims to provide comprehensive Chinese-language search results and feeds the company’s AI-driven NLP models.

🌐 Technical Behavior

WebGather uses multiple concurrent connections with a default rate of 10–20 requests per second, but it respects Crawl-Delay directives in robots.txt. Its IP ranges are allocated from ASNs owned by Tencent, primarily 101.226.0.0/16, 106.120.0.0/16, and 182.118.0.0/16. It employs a breadth-first crawl strategy and supports both HTTP and HTTPS, often sending a From header with the email [email protected] for contact. The crawler also follows noindex meta tags and respects robots.txt rules. It can render JavaScript to extract dynamic content but limits this to essential pages.

📋 robots.txt Compliance

Sogou explicitly states in its webmaster guidelines that WebGather honors Disallow and Crawl-Delay directives. Compliance is verified by numerous webmaster reports and the availability of a robots.txt tester on Sogou’s portal. The crawler also respects noarchive and nofollow meta tags as documented in Sogou’s help pages.

🔍 Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; Sogou web spider/4.0; +http://www.sogou.com/docs/help/webmasters.htm). A variant for news aggregation is Sogou News Spider. Older versions may appear as WebGather/1.0 or Sogou/2.0. Behavioral fingerprints include a consistent request interval of 0.5–2 seconds and the inclusion of the From header. Its Accept-Encoding header typically includes gzip and deflate.

📊 Data Usage

Collected data is used to build Sogou’s search index, improve ranking algorithms, and train AI models for Chinese natural language processing, including the Sogou Voice Assistant and knowledge graph. It also powers real-time news aggregation for Sogou News. Sogou claims they do not intentionally collect personally identifiable information and adhere to Chinese data privacy regulations.

⚙️ Rate Limiting Policy

WebGather is rate-limited because its high default crawl rate can degrade web server performance. The recommended threshold for blocking is when requests exceed 100 per minute from a single IP, as per Sogou’s published guidelines. This policy ensures fair resource usage while allowing legitimate indexing for the search engine.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.