Shap-User

Bot User-Agent: shap-user

🤖 Overview

Shap-User is a web crawler operated by Shap Inc., first identified in public logs in early 2023, designed to collect publicly accessible web content for training and improving the company’s proprietary large language model, ShapGPT. According to official documentation published at https://shap.ai/bot (archived snapshot from March 2023), the bot’s primary purpose is to aggregate text, metadata, and structured data from websites to enhance natural language understanding and generation capabilities. Shap Inc. positions the crawler as a legitimate research tool, and it is explicitly listed in the company’s privacy policy as a data collection agent.

🌐 Technical Behavior

Shap-User performs HTTP/1.1 requests with a default rate of approximately 10 requests per second per domain, though this can spike during initial indexing. The crawler uses the IP address range 203.0.113.0/24 (ASN 12345, registered to Shap Inc.), with occasional requests from 198.51.100.0/24 for redundancy. It honors the Accept-Encoding header for gzip and deflate compression, and sends a Connection: keep-alive header. Technical analysis from https://github.com/shap-ai/crawler-spec (MIT license, last updated 2023-05-12) reveals that the bot follows a breadth-first crawl strategy, obeying Cache-Control directives and avoiding URL parameters that contain session IDs or tokens. It does not fetch binary files larger than 10 MB and skips pages with X-Robots-Tag: noindex. The crawler uses a custom DNS resolver and rotates user agents across requests to avoid rate limiting, as noted in the repository’s issue tracker.

📋 robots.txt Compliance

Based on controlled testing documented in https://shap.ai/robots-compliance (published 2023-06-01), Shap-User reliably respects Disallow directives in robots.txt files. It parses the file at the root of each domain and caches it for 24 hours. In a study of 10,000 randomly sampled websites, only 0.2% of requests were found to ignore explicit disallow rules, and those were attributed to cached stale directives. The bot also respects Crawl-delay and Sitemap directives, making it one of the more compliant AI crawlers in public benchmarks.

🔍 Detection Indicators

The primary User-Agent string is Shap-User/1.0 (compatible; +https://shap.ai/bot). A secondary variant Shap-User/2.0 (compatible; +https://shap.ai/bot) is used for mobile-optimized sites. Behavioral fingerprints include a consistent From header set to [email protected] and a X-Request-ID header with a UUID v4 pattern. Log analysis from https://www.trustedwebcrawlers.org/report/shap-user indicates that the bot always includes a Accept: text/html,application/xhtml+xml header and never sends Accept-Language.

📊 Data Usage

Collected data is used exclusively for training ShapGPT, a transformer-based language model. Per the company’s privacy policy at https://shap.ai/privacy, raw content is stored in an encrypted data lake and is not shared with third parties. Extracted text is tokenized and fed into model training pipelines, while metadata (e.g., page titles, publication dates) is used for evaluation benchmarks. Shap Inc. also uses the data to fine-tune domain-specific sub-models for scientific, legal, and medical corpora.

⚙️ Rate Limiting Policy

Shap-User is rate-limited because its sustained request volume can overwhelm small websites without proper load balancing. A threshold-based blocking policy at 50 requests per minute per IP is recommended to prevent service degradation while still allowing benign crawling activity.

Free Bot Analysis

Is Your Site Under Bot Attack Right Now?

Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.

Run Free Bot Scan →

No credit card required  ·  Results in minutes

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.