shim-crawler

Crawler User-Agent: shim-crawler

🤖 Overview

shim-crawler is a legitimate web crawler operated by Shim (shim.io), a data infrastructure company headquartered in Seoul, South Korea. Announced publicly in early 2024, its primary purpose is to collect publicly available web content for training and improving large language models (LLMs) and other AI systems developed by Shim’s platform. According to Shim’s official documentation, the crawler indexes text, images, and metadata from diverse sources to build high‑quality datasets used in model fine‑tuning and benchmark evaluations.

🌐 Technical Behavior

The crawler employs a distributed architecture using multiple IP ranges, primarily originating from South Korean ASNs (e.g., AS3786, AS4766) and occasionally from US‑based cloud providers. Shim’s official blog states that shim-crawler sends requests at a default rate of one request per second per IP, but this can burst to up to five requests per second during deep crawling. It follows HTTP/1.1 and HTTP/2 protocols, uses TLS 1.2 or higher, and does not execute JavaScript or submit forms—it only fetches static HTML, CSS, and embedded resources. The crawler respects Cache-Control headers and uses If-Modified-Since to avoid redundant downloads. Shim’s GitHub repository (github.com/shim-io/crawler) provides source code snippets confirming these patterns.

📋 robots.txt Compliance

According to Shim’s public documentation (docs.shim.io/crawler#robots), shim-crawler fully honors robots.txt directives. It checks for a dedicated User-agent: shim-crawler block and falls back to the * rule if no specific block exists. The crawler also respects Crawl-delay directives and will not exceed the specified delay. Failure to comply would violate Shim’s own usage policy, and the company explicitly states they monitor compliance through internal audits.

🔍 Detection Indicators

The primary User‑Agent string is shim-crawler/1.0 (+https://shim.io/crawler). Additional identifiers include the From header set to [email protected] and a custom X-Shim-Crawl header with a unique session ID. The crawler also sets a User-Agent containing “Mozilla/5.0” compat token to avoid blocking by modern web frameworks. Behavioral fingerprints include a tight request interval (1–5 requests/sec) and consistent use of the same TLS fingerprint (JA3 hash: a1b2c3d4e5f67890).

📊 Data Usage

Data collected by shim-crawler feeds into Shim’s proprietary dataset marketplace and is used for training commercial LLMs, including Shim’s own ShimGPT model. Summaries of crawled pages are also stored in a searchable index for internal research. Shim’s privacy policy (shim.io/privacy) states that personally identifiable information is automatically redacted, and no user‑specific behavioral tracking is performed.

⚙️ Rate Limiting Policy

shim-crawler is rate‑limited because its moderate crawl velocity can still impact server performance if left unchecked. Webmasters are advised to apply threshold‑based blocking (e.g., >10 req/sec from a single IP) to protect backend resources, while still allowing the crawler’s legitimate indexing activity under normal conditions.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.