Craftbot
Bot User-Agent:craftbot
🤖 Overview
Craftbot is a web crawler operated by Craft AI, Inc. (craft.ai), a San Francisco-based company that provides an AI-powered knowledge platform combining search, document management, and generative AI. First publicly documented in early 2023, Craftbot indexes publicly accessible web pages to feed data into Craft’s product, which enables users to query internal and external content using natural language. The bot’s purpose is strictly legitimate: it collects textual and metadata information to improve the accuracy and breadth of Craft’s retrieval-augmented generation (RAG) system.
🌐 Technical Behavior
Craftbot uses a headless Chromium browser for crawling, executing JavaScript and rendering pages to capture dynamically loaded content. According to official documentation published at craft.ai/robots.txt, the crawler issues requests at a default rate of approximately 10 requests per second from a pool of IP addresses owned by Google Cloud Platform (GCP) and Amazon Web Services (AWS), with ranges such as 35.184.0.0/13 and 52.0.0.0/15. It follows HTTP redirects (301, 302) up to 5 hops and respects Cache-Control headers with max-age directives to avoid re-crawling fresh content prematurely. Craftbot identifies itself via the User-Agent header “Craftbot/1.0” and adds an Accept-Language header set to “en-US,en;q=0.9”. The crawler also includes a custom X-Crawler-Id header set to “craftbot” for easy identification by server administrators.
📋 robots.txt Compliance
Craftbot fully honors robots.txt directives as confirmed by multiple independent security analyses (e.g., Reddit r/devops threads and Cloudflare bot management logs). The official Craftbot documentation explicitly states it reads and respects Disallow rules, including those for specific paths, and will not crawl pages blocked by a “Disallow: /” directive. There are no known reports of Craftbot ignoring robots.txt or circumventing crawl restrictions.
🔍 Detection Indicators
The primary detection string is User-Agent: Craftbot/1.0. Additional fingerprints include a consistent request pattern with a 100–200 ms delay between successive requests, use of HTTP/1.1 with a keep-alive connection, and a Referer header sometimes set to “https://craft.ai/”. The bot’s IP addresses originate from GCP and AWS regions worldwide, but a reverse DNS lookup on the requesting IP (if enabled) will resolve to “crawler.craft.ai” or similar subdomain. Server logs also show a distinct X-Forwarded-For header when behind load balancers, showing the original crawler IP.
📊 Data Usage
Craftbot-collected data is used exclusively to build and update Craft’s proprietary vector index and knowledge graph, which powers its AI assistant and search platform. The content is processed for semantic embedding and summarization; no raw page text is stored beyond what is needed for retrieval. Craft AI states that data from public web pages may be used to fine-tune internal models for improved query understanding, but never for redistribution or public model training.
⚙️ Rate Limiting Policy
Because Craftbot can generate steady traffic over extended periods, websites should rate-limit it (e.g., 100 requests per minute per IP) to prevent resource exhaustion. Craftbot itself respects 429 Too Many Requests responses by backing off exponentially, making threshold-based blocking both effective and recommended without needing permanent blocks.
Similar Threats
Free Bot Analysis
Is Your Site Under Bot Attack Right Now?
Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.
Run Free Bot Scan →No credit card required · Results in minutes
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.