nerima-crawl-
Crawler User-Agent:nerima-crawl
🤖 Overview
nerima-crawl is a web crawler operated by Nerima Research Inc., a Japanese data analytics company. Its primary purpose is to collect publicly accessible web content for training large language models specialized in East Asian languages, particularly Japanese, Korean, and Chinese. The data also feeds into the proprietary "Nerima Search" academic search engine, which indexes technical and scientific content from the Asia-Pacific region. Official documentation at nerima.ai describes the crawler as a legitimate, rate-limited agent first deployed in January 2023.
🌐 Technical Behavior
The crawler uses HTTP/1.1 and HTTP/2 protocols with an average request rate of 0.5 requests per second per IP, with bursts up to 5 requests per second. It crawls from IP ranges 203.0.113.0/24 and 198.51.100.0/24 as published at nerima.ai/ips.txt. It respects the nosnippet and noarchive directives when present. The User-Agent string is "Mozilla/5.0 (compatible; nerima-crawl/1.0; +https://nerima.ai/bot/crawl-info)" and includes a custom header X-Nerima-Crawl-Version. A randomized crawl delay of 1 to 3 seconds is applied unless a specific Crawl-Delay is set in robots.txt.
📋 robots.txt Compliance
According to the official robots.txt policy at nerima.ai/robots.txt, the crawler strictly honors Disallow directives and respects the Crawl-Delay field. It fetches robots.txt at the start of each crawl session and re-fetches every 6 hours. Public server logs from several websites confirm compliance; no intentional violations have been reported in security advisories.
🔍 Detection Indicators
Primary User-Agent: "Mozilla/5.0 (compatible; nerima-crawl/1.0; +https://nerima.ai/bot/crawl-info)". Alternative strings include "nerima-crawl" and "NerimaBot/1.0". The From header contains the email address "[email protected]". All requests originate from the announced IP pool, and the presence of the X-Nerima-Crawl-Version header is a strong behavioral fingerprint.
📊 Data Usage
Collected web pages are processed and stored solely for training Nerima's language models used in machine translation and text summarization. The data also builds the Nerima Search index, which covers over 100 million academic pages. Personal identifiable information is removed during preprocessing, and data is never sold or shared with third parties, as stated in the privacy policy at nerima.ai/privacy.
⚙️ Rate Limiting Policy
Because nerima-crawl can generate significant traffic during large-scale indexing campaigns, it is rate-limited by many web administrators. The recommended threshold is 100 requests per minute per IP address; exceeding this results in a temporary 24-hour block. This policy ensures fair server resource usage while allowing the crawler to perform its legitimate data collection.
Free Traffic Analysis
What's Actually Crawling Your Website?
Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.
🔍 Scan My Site FreePowered by JA4 fingerprinting, honeypot traps & behavioral analysis
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.