Thumbor Bot — Detection, Blocking & Technical Analysis

Thumbor

Bot User-Agent: thumbor

🤖 Overview

Thumbor is an open-source smart imaging service originally developed by globo.com and currently maintained by the community under the Thumbor project (GitHub repository: https://github.com/thumbor/thumbor). It operates as an HTTP-based image processing proxy that can be deployed on web servers to automatically fetch, resize, crop, and optimize images from remote URLs. While not a traditional search engine bot, Thumbor is frequently configured as a legitimate automated agent that retrieves images from external sources when processing requests from client applications, such as content management systems or social media platforms.

🌐 Technical Behavior

Thumbor’s crawl behavior is driven by configuration rather than a fixed set of rules; by default, it uses the underlying HTTP client’s User-Agent (often Python-urllib or a configurable custom string) and does not have a dedicated IP range since it is self-hosted by each deploying organization. When an image URL is requested through Thumbor (e.g., /unsafe/300x200/example.com/image.jpg), the service makes an HTTP GET request to the original image source, following redirects and handling up to a configurable timeout (typically 10–30 seconds). Thumbor can be set to use a fixed rate of concurrent connections (commonly 10–50) per host, and its requests may include an X-Forwarded-For header forwarding the original client IP. Official documentation at https://thumbor.readthedocs.io/ details the ability to add custom security filters (e.g., whitelists) to limit which URLs can be fetched.

📋 robots.txt Compliance

Thumbor itself does not have a built-in mechanism to check robots.txt because it is an image-processing middleware rather than a traditional web crawler. However, operators can configure Thumbor to respect robots.txt by integrating a third-party library (e.g., robotparser) into its loader pipeline, as noted in community discussions on the project’s GitHub issues. In practice, most production deployments either whitelist allowed source domains or rely on rate limiting rather than robots.txt compliance.

🔍 Detection Indicators

The default User-Agent string for Thumbor when using Python’s urllib is Python-urllib/3.x, but this can be overridden in configuration (e.g., THUMBOR_USER_AGENT = "MyCustomAgent/1.0"). Behavioral fingerprints include requests with a X-Forwarded-For header mirroring the original client IP and a lack of a Referer header in many setups. Security advisories (e.g., CVE-2020-5271, a path traversal vulnerability patched in v6.7.1) highlight that Thumbor’s HTTP client can be identified by its typical request patterns, such as requesting image URLs with query parameters like /unsafe/filters.

📊 Data Usage

Data collected by Thumbor — the requested image bytes and optional metadata (e.g., EXIF) — is used solely for real-time image transformation and caching; the service does not store or index the images for any machine learning or analytics purposes. The processed image is returned to the requesting client and may be cached in a local storage backend (e.g., filesystem, Redis) for performance. Thumbor can also generate logs of requested URLs, which operators may use for monitoring, but the project explicitly does not include any data collection for third-party AI training.

⚙️ Rate Limiting Policy

Thumbor is rate-limited per host by administrators because aggressive fetching from a single source could overload the origin server, especially in high-traffic deployments. A threshold-based blocking policy (e.g., 100 requests per minute) is recommended in the official documentation to prevent abuse while still allowing legitimate image processing for end users.

Similar Threats

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

Thumbor

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

What's Actually Crawling Your Website?

Company

Resources

Services

Trusted

Subscribe