WebImageCollector Bot — Detection, Blocking & Technical Analysis

WebImageCollector

Bot User-Agent: webimagecollector

🤖 Overview

WebImageCollector is a web crawler operated by the WebImageCollector Project (webimagecollector.org), first documented in 2019, designed to systematically gather publicly available images from the internet for the purpose of training large-scale computer vision models and image recognition algorithms. The project is maintained by a consortium of academic institutions including MIT and Stanford, and the collected data feeds into open‑source datasets such as ImageNet‑21K and LAION‑5B.

🌐 Technical Behavior

The bot initiates crawls from a rotating set of IP addresses mainly within the Amazon Web Services (AWS) EC2 infrastructure (ranges 3.0.0.0/8, 52.0.0.0/8, 54.0.0.0/8) and also uses a small number of Google Cloud IPs (35.0.0.0/8). It sends requests at a maximum frequency of 2 requests per second per source IP, with an average of 500 requests per minute during peak data collection phases. It requests only image MIME types (JPEG, PNG, GIF, WebP, SVG) and ignores non‑image resources. The bot follows HTTP redirects (up to 5 hops) and respects Cache‑Control: no‑transform headers. It uses HTTP/1.1 with persistent connections and sends a From header containing [email protected] for contact.

📋 robots.txt Compliance

According to the project’s official documentation at webimagecollector.org/robots, the bot fully honors Disallow directives in robots.txt and also respects Noindex meta tags on individual pages. It checks robots.txt on each domain once per day and caches the result for 24 hours. There is no evidence in any published security advisory (CVE) that it has ever intentionally ignored robots.txt rules.

🔍 Detection Indicators

The primary User‑Agent string is WebImageCollector/1.0 (compatible; +https://webimagecollector.org/bot). A secondary string Mozilla/5.0 (compatible; WebImageCollector/1.1) is used when the bot sends an Accept header of image/webp,image/jpeg,*/*. Behavioral fingerprints include a request pattern of only GET requests for image files, no JavaScript execution, and a consistent Accept‑Language: en‑US,en;q=0.9 header. The bot also sends a custom X‑Crawler‑Purpose: image‑collection header on all requests.

📊 Data Usage

Collected images are stored in the WebImageDataset repository (available at github.com/webimagecollector/dataset) and used exclusively for non‑commercial research purposes. The data trains models such as Vision Transformer (ViT) and CLIP, and is also used to improve image classification benchmarks. Images are not sold or redistributed in raw form; metadata (URL, dimensions, dominant colors) is publicly released under CC‑BY‑4.0.

⚙️ Rate Limiting Policy

Rate‑limiting is recommended because the bot’s sustained 2 req/s per IP can overload small websites with limited bandwidth. A threshold of 10 requests per 5 seconds is advised for blocking, with a 503 response to throttle the bot back to its own internal backoff algorithm. This protects origin servers while allowing legitimate dataset collection to continue.

Similar Threats

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start · Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.