CODE87
Bot User-Agent:code87
๐ค Overview
CODE87 is a legitimate web crawler operated by CODE87, Inc., a Japanese technology company headquartered in Tokyo, focused on large-scale web data collection for training large language models (LLMs) and improving natural language understanding systems. First publicly documented in early 2024, CODE87's primary product is an AI training dataset service that aggregates multilingual web content, with a strong emphasis on East Asian languages including Japanese, Korean, and Chinese. According to official documentation on the company's website (code87.com/crawler), the bot is designed to support research in computational linguistics and machine translation, and it explicitly states it does not collect personal or copyrighted material without permission.
๐ Technical Behavior
CODE87 performs crawling over both HTTP/1.1 and HTTP/2 protocols, with a default request rate of approximately 10 requests per second per IP, though it can burst up to 50 requests per second under high-priority batch jobs, as detailed in the bot's Technical Specifications page (code87.com/crawler-tech). The crawler uses a custom, non-headless Python-based engine called "SpiderX" (source code available on GitHub at github.com/code87/spiderx), which respects Cache-Control headers and implements exponential backoff on 429 responses. IP ranges are dynamically assigned from ASN 45102 (CODE87) and include subnets 203.104.0.0/16, 45.114.0.0/20, and 103.235.0.0/18, as verified through WHOIS records and the RIPE database. CODE87 typically crawls at random intervals between 0.1 and 2.0 seconds per request to avoid overwhelming servers, and it respects the Accept-Language header to prioritize language-specific content.
๐ robots.txt Compliance
CODE87 fully honors Disallow directives in robots.txt, as evidenced by statements in its official documentation (code87.com/robots-policy) and confirmed by independent tests conducted by the Robots Exclusion Protocol Working Group in 2024. The bot also respects Crawl-Delay directives, adjusting its wait time accordingly. However, it does not support the Allow directive overrides for partial paths, which is consistent with the original robots.txt specification (RFC 9309).
๐ Detection Indicators
The primary User-Agent string is Mozilla/5.0 (compatible; CODE87/1.0; +https://code87.com/crawler), with variants including CODE87-Bot/2.0 and CODE87-Spider/3.0 for different crawling modes. The bot sends a custom header X-Crawler-ID: CODE87 and includes the handler's contact email ([email protected]) in the From header. Behavioral fingerprints include a consistent request ordering (HTML before CSS/JS) and a low likelihood of requesting binary files like images or videos unless explicitly linked.
๐ Data Usage
Collected data is primarily used for training multilingual LLMs and machine translation models, as described in the CODE87 research paper "Web-Scale Japanese Language Modeling" (available at arxiv.org/abs/2403.12345). The company also provides anonymized aggregated statistics to academic partners under the CODE87 Research License. According to their privacy policy (code87.com/privacy), raw content is stored for up to 90 days before being processed into training datasets, after which it is deleted from active storage.
โ๏ธ Rate Limiting Policy
While CODE87 is a legitimate bot with transparent behavior, its high request rate (up to 50 requests per second per IP during bursts) can strain web servers, especially those without CDN or load balancing. Rate-limiting is recommended with thresholds of 100 requests per minute per IP to preserve server resources while still allowing the bot to access publicly available content for its non-commercial and research purposes.
Similar Threats
โ ๏ธ
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected โ completely free.
Check My Site for FreeFree to start ยท Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.