webxm Bot — Detection, Blocking & Technical Analysis

webxm

Bot User-Agent: webxm

🤖 Overview

WebXM is a web crawler operated by WebX Media Inc., first documented in early 2022, primarily designed to collect publicly accessible web content for training generative AI models and improving search relevance algorithms within the WebX AI platform. Official documentation (webxm.com/bot) states the bot is used for research and benchmarking, and it feeds data into the proprietary WebX LLM and knowledge graph system.

🌐 Technical Behavior

WebXM issues requests at a rate of up to 30 requests per second per source IP, using a rotating pool of IPv4 and IPv6 addresses from AWS, Google Cloud, and Azure. It prefers HTTP/2 and TLS 1.3 connections and sends a Referer header set to the target URL. The bot does not process JavaScript or execute scripts; it only fetches static HTML and linked resources (robots.txt, sitemap.xml, images, CSS). According to its GitHub repository (github.com/webxm/bot), it supports conditional GET requests via If-Modified-Since and ETag to reduce server load, but in practice many sites report it ignores cache headers.

📋 robots.txt Compliance

WebXM adheres to the Robots Exclusion Standard: it reads robots.txt before each crawl session and respects Disallow directives. However, the bot does not honor Crawl-Delay directives; instead it uses its own rate-limiting algorithm. The official site (webxm.com/robots) confirms compliance but warns operators that the bot may re-fetch robots.txt only once per domain per day, so updates take time to take effect.

🔍 Detection Indicators

The primary User-Agent string is webxm/1.0 (compatible; WebX Bot; +https://webxm.com/bot). It also sends a custom header X-WebX-Bot: true. Behavioral fingerprints include a request pattern of exactly 5 simultaneous connections and a fixed user-agent token for all sub-resources. The bot does not include a From header and does not vary its user-agent across requests.

📊 Data Usage

Collected data is used exclusively for training WebX Media’s proprietary language models and building the WebX Knowledge Graph, which powers contextual search and question-answering features. According to their privacy policy (webxm.com/privacy), no personal or copyrighted content is retained in training sets beyond fair-use excerpts, and all raw data is deleted after 30 days. The company publishes aggregated statistics about crawled domains on its transparency dashboard.

⚙️ Rate Limiting Policy

WebXM is rate-limited because its aggressive default crawl speed (up to 30 req/s) can overwhelm under-provisioned servers. The recommended threshold for rate limiting is 200 requests per minute per IP, with a 429 response triggering a backoff; this policy ensures the bot remains productive without causing denial-of-service conditions for other traffic.

Similar Threats

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start · Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

webxm

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

Your Site May Be Hemorrhaging Revenue to Bots

Company

Resources

Services

Trusted

Subscribe