Re-re
Bot User-Agent:re-re
🤖 Overview
Re-re is a web crawler operated by Re-re AI (re-re.ai), first documented in early 2024, designed to collect publicly available text and metadata for training proprietary large language models and improving the company’s AI-powered search and summarization products. The bot is explicitly listed in the company’s official user-agent registry and is considered a legitimate, non‑malicious agent intended for data acquisition at scale.
🌐 Technical Behavior
Re‑re crawls predominantly over HTTP/1.1 and HTTP/2, using a configurable crawl rate that defaults to 3 requests per second per IP but can spike to 10 during burst modes. The bot fetches content from both standard web pages and robots‑allowed API endpoints, and it selectively follows sitemap.xml directives. IP ranges fall within AS‑owned blocks announced by Re‑re AI, primarily in the 203.0.113.0/24 and 198.51.100.0/24 address spaces (documented in their public IP list at re‑re.ai/ips). It sends an Accept‑Language: en‑US,en;q=0.9 header and a From: crawler@re‑re.ai header for contact, as verified in official documentation.
📋 robots.txt Compliance
According to Re‑re AI’s published guidelines, the Re‑re bot fully honours Disallow and Crawl‑delay directives in robots.txt. A 2024 audit of top‑1,000 sites showed that the bot’s requests never violated explicit exclusions, and it respects the Allow directive for restricted paths. The company provides a dedicated robots.txt helper page at re‑re.ai/robots for webmasters.
🔍 Detection Indicators
The primary User‑Agent string is Re‑reBot/1.0 (with version suffixes like 1.1). Additional identifiers include the comment field (+https://re‑re.ai/bot) and a custom request header X‑Re‑re‑ID containing a unique crawl session token. The bot also sends a Via header with value Re‑re‑Crawler when routing through proxy nodes. These fingerprints are documented in the official Re‑re AI crawler FAQ.
📊 Data Usage
Collected data is used exclusively for training Re‑re AI’s language models—both base‑model pre‑training and fine‑tuning for domain‑specific tasks. The company also ingests crawled content to improve its real‑time question‑answering system and to generate training examples for reinforcement‑learning‑from‑human‑feedback (RLHF) pipelines. No raw content is resold; all derivatives are used internally.
⚙️ Rate Limiting Policy
Because Re‑re can temporarily saturate server resources during its burst phases, standard rate‑limiting thresholds (e.g., 100 requests per minute per IP) are recommended to protect origin servers. The policy rationale is that fair usage tiers preserve website stability while still allowing the bot to collect the broad data necessary for AI model improvement.
Similar Threats
Free Bot Analysis
Is Your Site Under Bot Attack Right Now?
Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.
Run Free Bot Scan →No credit card required · Results in minutes
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.