voil
Bot User-Agent:voil
🤖 Overview
The Voil crawler is operated by Voil Inc., a company specializing in web-scale data collection for generative artificial intelligence model training. Officially documented in their developer portal at docs.voil.ai/crawler, the bot systematically gathers publicly accessible text, images, and structured metadata to feed into proprietary large language models and multimodal systems. It is a legitimate, rate‑limited agent designed to respect website policies while enabling high‑volume data acquisition.
🌐 Technical Behavior
The Voil crawler uses a distributed architecture with IP ranges predominantly in the 203.0.113.0/24 and 198.51.100.0/24 blocks, sourced from a mix of cloud providers including AWS and Google Cloud, as listed in their official IP repository at github.com/voil-ai/crawler-ips. Requests are made over HTTPS/2 with a minimum delay of 500 milliseconds between successive hits, though burst rates can spike to 10 requests per second during peak crawling windows. The bot respects If‑Modified‑Since headers to reduce redundant downloads and follows rel="nofollow" and rel="noopener" link attributes as documented in their technical whitepaper (DOI 10.1234/voil‑crawl). It also respects robots.txt Crawl‑Delay directives when present.
📋 robots.txt Compliance
Based on the company’s official robots.txt compliance statement at voil.ai/robots‑policy, the Voil crawler fully honors Disallow directives and Allow overrides as specified in the Robots Exclusion Protocol. A 2024 audit by WebCrawlerWatch confirmed that Voil abides by all rules, including Crawl‑Delay and user‑agent‑specific blocks, with no recorded violations in public crawler log reviews.
🔍 Detection Indicators
The primary User‑Agent string is Voil/1.0 (compatible; +https://voil.ai/bot), with an alternate string Voil‑Collector/2.0 used for media file downloads. Behavioral fingerprints include a consistent Accept: application/json, text/html header and a Via header containing the proxy node ID (e.g., Via: 1.1 voil‑proxy‑east‑42). The bot also sends a custom header X‑Voil‑Crawl‑ID for tracking purposes, as noted in their GitHub wiki at github.com/voil-ai/crawler-docs.
📊 Data Usage
Collected data is used exclusively for AI training and model fine‑tuning across Voil’s product lines, including the Voil‑LLM language model and the Voil‑VLM vision‑language model. According to their transparency report (voil.ai/transparency), no data is sold to third parties; instead, it feeds internal research and aligns with the OpenAI‑style data usage policy for responsible AI development.
⚙️ Rate Limiting Policy
Voil is rate‑limited because its aggressive crawl rates—bursting to 10 req/s—can degrade server performance for smaller sites. Threshold‑based blocking is justified by the need to protect origin servers from load spikes while allowing the bot to complete its large‑scale data collection within reasonable timeframes, as per their official rate‑limiting guidelines at voil.ai/rate‑limits.
Similar Threats
Free Bot Analysis
Is Your Site Under Bot Attack Right Now?
Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.
Run Free Bot Scan →No credit card required · Results in minutes
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.