datafountains
Bot User-Agent:datafountains
🤖 Overview
DataFountains is a legitimate web crawler operated by the web intelligence company DataFountains Inc., headquartered in San Francisco, California. According to the official DataFountains documentation (datafountains.com/crawler), the bot systematically gathers publicly accessible web content to supply a large-scale structured dataset used for training commercial natural language processing (NLP) models and powering the company’s ContentDNA™ analytics platform. The crawler was first publicly documented in a 2021 blog post detailing its ethical scraping principles.
🌐 Technical Behavior
The DataFountains crawler employs a multi-threaded, JavaScript-rendering engine based on Puppeteer and Headless Chrome 112, allowing it to interact with SPAs and dynamically loaded content. Its crawl frequency is moderate, respecting a maximum of 10 requests per second per domain as stated in its official rate-limit policy. IP addresses are drawn from two public AWS EC2 ranges (54.193.0.0/16 and 35.167.0.0/16) and a dedicated /20 netblock owned by DataFountains Inc. The crawler sends requests over both HTTP/1.1 and HTTP/2, and includes a From: [email protected] email header for contact. It always sets the Accept-Language header to en-US,en;q=0.9 and rarely follows redirects beyond three hops.
📋 robots.txt Compliance
DataFountains fully honors robots.txt directives, as verified by its 2022 transparency report (datafountains.com/robots-compliance). The crawler checks the file at the start of each crawl session and caches it for up to 24 hours. It respects Crawl-Delay directives and automatically backs off if a server returns 429 Too Many Requests responses. Internal testing shows it never bypasses disallowed paths, even on domains with complex rule sets.
🔍 Detection Indicators
The primary User-Agent string is Mozilla/5.0 (compatible; DataFountains/2.0; +https://datafountains.com/crawler). A secondary UA is used for mobile rendering: DataFountains-Mobile/1.0 (Android 12; compatible; +https://datafountains.com/crawler). Behavioral fingerprints include a fixed request interval of 100–150ms between pages and a distinct X-DataFountains-ClientID HTTP header containing a UUID v4. The crawler never sends cookies or persists sessions.
📊 Data Usage
Collected content is parsed, deduplicated, and stored in DataFountains’ private WebGraph database, which feeds their NLP model training pipeline and the ContentDNA analytics dashboard. According to the company’s privacy policy, data is used exclusively for internal AI research and is never resold or shared with third parties. The dataset excludes personally identifiable information (PII) via automated redaction of email addresses, phone numbers, and social security numbers.
⚙️ Rate Limiting Policy
Although DataFountains is a legitimate, non-malicious agent, its moderate crawl volume (up to 10 req/s) can still overwhelm under-resourced servers. Therefore, web administrators are advised to rate‑limit it at the edge to prevent performance degradation while still permitting its beneficial data collection for AI training advancements.
Similar Threats
Free Traffic Analysis
What's Actually Crawling Your Website?
Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.
🔍 Scan My Site FreePowered by JA4 fingerprinting, honeypot traps & behavioral analysis
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.