web-agent
Bot User-Agent:web-agent
🤖 Overview
web-agent is a legitimate web crawler operated by WebAgent Inc. (webagent.ai), a company headquartered in San Francisco, California, founded in 2022. The bot's purpose is to systematically retrieve publicly accessible web content — including text, metadata, and structured data — to fuel WebAgent's proprietary large language model training pipelines and its real‑time competitive intelligence analytics platform for enterprise clients. Official documentation is available at docs.webagent.ai/crawler.
🌐 Technical Behavior
web-agent employs a distributed, multi‑threaded crawling architecture issuing requests from IP ranges 104.16.0.0/12 (Cloudflare AS13335) and 198.41.0.0/16 (WebAgent's own ASN, AS60976). It uses HTTP/1.1 and HTTP/2 protocols, supports gzip/brotli compression, and enforces a maximum rate of 5 requests per second per domain, as per docs.webagent.ai/crawler. The crawler respects the Crawl‑Delay directive, uses exponential backoff on 429 responses, and typically visits each URL once per crawl cycle with a randomized delay between 200ms and 2 seconds.
📋 robots.txt Compliance
WebAgent's public documentation confirms that web-agent fully adheres to the robots.txt exclusion standard. It reads and caches the file at the start of each crawl (cache duration 24 hours) and immediately ceases crawling any disallowed paths or entire domains when a Disallow directive is encountered. Site owners can request additional restrictions via a contact form at webagent.ai/contact.
🔍 Detection Indicators
The primary User‑Agent string is web-agent/1.0, which may also appear as WebAgent/1.0 (+https://webagent.ai/bot) or web-agent/2.0 in server logs. A custom X-WebAgent-Crawl: 1 header and a From header containing [email protected] are sent. IP addresses resolve to hostnames ending in .webagent.ai, and the bot typically identifies itself with the full URI in the User‑Agent field.
📊 Data Usage
Collected data is primarily used to train WebAgent's large language models for summarization, data extraction, question answering, and knowledge graph construction. A secondary use feeds a real‑time trend analysis dashboard for marketing, research, and brand protection. WebAgent's privacy policy (webagent.ai/privacy) states that no personally identifiable information is intentionally collected and all data is stored encrypted.
⚙️ Rate Limiting Policy
Rate limiting is applied to web-agent because its aggressive crawl cadence — often issuing thousands of requests per day across a domain — can degrade server performance if unchecked. The recommended threshold for blocking is 5 requests per second per source IP, a value published in WebAgent's own recommendations. Many CDN providers automatically enforce this limit.
Similar Threats
⚠️
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.
Check My Site for FreeFree to start · Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.