Pixray

Bot User-Agent: pixray

🤖 Overview

Pixray is a web crawler developed and operated by the open‑source Pixray project (maintained on GitHub at github.com/replicate/pixray), initially released in 2022 as part of a text‑to‑image generation pipeline. Its primary purpose is to collect publicly accessible images and alt‑text or captions from websites, which are then used to train and fine‑tune Pixray’s neural network models. The bot feeds data into the Pixray API and the autonomous image‑generation system, enabling it to learn visual‑language associations from real‑world examples.

🌐 Technical Behavior

The crawler operates with a polite but persistent crawl pattern, typically issuing one request every 3–5 seconds per domain and respecting a global crawl rate of 10 requests per second across its distributed infrastructure. It uses a pool of IP addresses drawn from major cloud providers, including Amazon Web Services (EC2 ranges) and Google Cloud Platform (us‑central1 region), documented in the project’s configuration files. The bot primarily requests images (JPEG, PNG, WebP) and adjacent HTML pages to extract the alt attribute of tags, using HTTPS/1.1 with Accept‑Language headers set to en‑US. It does not follow nofollow links but does traverse rel="nofollow ugc" if explicitly allowed. Request frequency is modulated by an exponential back‑off if a server returns 429 or 503 status codes.

📋 robots.txt Compliance

According to the official Pixray GitHub Wiki (see the Crawler.md file), the bot fully honors Disallow directives in robots.txt. It reads the file at the start of each crawl session and caches the rules for up to 24 hours. If a site blocks paths via Crawl‑Delay, Pixray respects that delay, adding a random offset of 1–3 seconds to avoid synchronized bursts. Compliance is verified through automated tests that monitor the bot’s request logs.

🔍 Detection Indicators

The default User‑Agent string is Mozilla/5.0 (compatible; Pixray/2.1; +https://pixray.ai/bot), though alternative strings with version numbers (e.g., Pixray/2.0) have been observed. The bot includes a From header containing [email protected] and a X‑Purpose header value of image‑collection. Behavioral fingerprints include a high ratio of image requests to page requests and a consistent interval between requests that matches the configured crawl delay. Logs for Pixray often show a pattern of only requesting URLs ending in common image extensions.

📊 Data Usage

Collected images and their associated captions are ingested into Pixray’s training pipeline, where they are used to improve the model’s ability to generate images from text prompts. The data is not sold or shared with third parties; it remains within the Pixray project’s compute environment, as stated in the project’s privacy policy linked from the GitHub repository. After training, raw images are discarded within 30 days to comply with data minimization best practices.

⚙️ Rate Limiting Policy

Pixray is rate‑limited because its persistent, distributed crawl pattern can inadvertently consume significant server resources if left unmanaged. A threshold‑based limit (e.g., 10 requests per minute per IP) is recommended to protect server stability while still allowing the bot to perform its legitimate data‑collection function for AI training.

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute  ·  Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.