sosospider

Crawler User-Agent: sosospider

🤖 Overview

Sosospider is an autonomous web crawler developed and operated by Soso Technology Co., Ltd., a Chinese internet services company affiliated with Tencent. Originally launched as part of the Soso search engine (soso.com), the crawler now primarily feeds data into Tencent's search and AI-driven content platforms, including the Tencent News aggregation system and the WeChat public content index. According to its official user-agent documentation, Sosospider collects publicly accessible web pages to build a large-scale search index and to power natural language processing datasets for internal AI models.

🌐 Technical Behavior

Sosospider identifies itself with the user-agent string Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm) and typically issues HTTP GET requests from IP ranges registered to Tencent in China (ASN 132591 and 132203). Its crawl frequency is moderate, averaging 1–3 requests per second per domain, but it can spike to 10+ requests per second when re-indexing high-traffic sites. The bot supports both HTTP/1.1 and HTTP/2 protocols and respects the noindex meta tag as well as X-Robots-Tag headers. It does not execute JavaScript or crawl dynamically loaded content, relying purely on server-rendered HTML. Official logs from major web hosts show Sosospider often follows a breadth-first traversal, respecting a crawl delay of 5–10 seconds when a Crawl-Delay directive is present.

📋 robots.txt Compliance

Sosospider officially states compliance with the robots.txt standard in its user-agent specification. Evidence from public server logs and security blogs confirms that it reliably halts on Disallow directives and respects Crawl-Delay values. However, some site administrators report occasional brief surges that ignore delays, though these are typically resolved by transient configuration issues rather than intentional disregard.

🔍 Detection Indicators

The primary detection method is the exact user-agent string: Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm). Additional behavioral fingerprints include a Accept-Encoding: gzip header and a From header (rarely used) stating [email protected]. Reverse DNS lookups for Sosospider IPs often resolve to *.soso.com or *.tencent.com.

📊 Data Usage

Collected web pages are indexed for Tencent's Soso Search engine and are also used to train machine learning models for Chinese-language natural language understanding tasks. Content summaries and snippets may appear in Tencent News feeds, while full text is stored temporarily for relevance ranking before deletion after 30 days according to Tencent's privacy policy.

⚙️ Rate Limiting Policy

Because Sosospider can exhibit aggressive re-crawl patterns during index refreshes and originates from shared IP pools, rate limiting is recommended to protect server resources. A threshold-based block after 50 requests per minute with a temporary ban of 15 minutes provides a reasonable defense against accidental overload while allowing legitimate indexing.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.