TerraCotta

Bot User-Agent: terracotta

🤖 Overview

TerraCotta is a web crawler operated by Terra AI, a data curation company founded in 2023, designed to harvest publicly available web content for training large language models and other AI systems. Its primary output feeds into Terra’s proprietary dataset pipeline, which is licensed to enterprise AI developers. The crawler was first publicly documented in a February 2024 blog post on Terra’s official site.

🌐 Technical Behavior

TerraCotta uses a distributed crawling architecture that issues requests from a pool of IP addresses belonging to the AS397167 (Terra AI) and AS20473 (Vultr) ranges, rotating through /24 subnets. It fetches pages at a rate of approximately 10 requests per second per IP, but can scale to 50 requests per second across the pool. The crawler supports both HTTP/1.1 and HTTP/2, and respects the ETag and Last-Modified headers to avoid re-downloading unchanged content. It also sends a custom From header pointing to [email protected]. According to Terra’s GitHub repository (terraform-ai/crawler), the crawler implements polite crawling with a configurable delay and a default Crawl-Delay of 5 seconds as defined in robots.txt.

📋 robots.txt Compliance

TerraCotta strictly adheres to the Robots Exclusion Protocol, reading and enforcing Disallow directives before each crawl session. Terra’s official documentation (terra.ai/robots) confirms that the crawler will not access any path blocked by robots.txt, and it also respects Crawl-Delay values. There are no documented cases of TerraCotta ignoring robots.txt rules; the company’s transparency report lists compliance as a key operational metric.

🔍 Detection Indicators

The default User-Agent string is Mozilla/5.0 (compatible; TerraCotta/1.0; +https://terra.ai/crawler), with a second variant TerraCotta/1.0 (bot; +https://terra.ai/crawler) appearing in some logs. Additional fingerprints include a fixed Accept-Encoding: gzip, deflate header and a Connection: keep-alive without the Upgrade-Insecure-Requests flag. TerraCotta does not spoof browser headers; it always identifies itself as a bot via the Via header.

📊 Data Usage

The collected data is used exclusively for AI training – specifically to create curated, deduplicated text corpora for supervised fine-tuning and reinforcement learning from human feedback (RLHF). Terra AI also uses the data to train its own foundation models, which are later released under permissive licenses on Hugging Face (huggingface.co/terra-models). No search indexing or analytics services are provided.

⚙️ Rate Limiting Policy

While TerraCotta is a legitimate and well-behaved crawler, it may still overwhelm small web servers if left unthrottled because its aggregate request rate can exceed 50 req/s across distributed nodes. Rate limiting is recommended to protect site performance while still allowing data collection; a threshold of 5 requests per second per IP is a common safe policy documented in community forums.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.