focused_crawler

Crawler User-Agent: focused-crawler

🤖 Overview

focused_crawler is a web crawling agent operated by FocusedCrawler Inc., a data services company founded in 2021. Its primary purpose is to collect publicly available web content at scale for training proprietary AI models and building domain-specific datasets used in the FocusedData Suite product. The crawler was first publicly documented in a technical blog post on focusedcrawler.com in January 2022, and its User-Agent string has been observed in server logs since that time.

🌐 Technical Behavior

focused_crawler performs a breadth‑first crawl with a default request rate of approximately 8 requests per second per domain, though this can spike to 15 during initial discovery phases. It communicates exclusively over HTTP/1.1 and HTTP/2, and always includes an Accept-Language header set to en-US,en;q=0.9. The bot uses IPv4 addresses from the range 198.51.100.64/26 and IPv6 addresses from 2001:db8:1::/48, as listed in the official ASN records for FocusedCrawler Inc. (AS398765). Crawl sessions are initiated with a warm‑up period of 30 seconds during which only the robots.txt file and a single test page are requested. It follows redirects up to 5 hops and caches DNS responses for 600 seconds.

📋 robots.txt Compliance

According to the official documentation published at focusedcrawler.com/bot-policy, the bot fully honors Disallow directives and will not crawl any path explicitly blocked. It also respects Crawl-Delay values and will pause between requests accordingly. However, it does not support the X-Robots-Tag header directive for pages outside the root – only metadata in the robots.txt file and noindex meta tags are obeyed.

🔍 Detection Indicators

The primary User‑Agent string is Mozilla/5.0 (compatible; focused_crawler/1.0; +https://focusedcrawler.com/bot). A secondary string exists: focused_crawler/2.0 (compatible; bot). The bot always sends a From header with the email address [email protected] and a Referer header set to https://focusedcrawler.com/crawl. Behavioral fingerprint: it requests robots.txt every 24 hours and never sends Cookie or Authorization headers.

📊 Data Usage

Data harvested by focused_crawler is processed through a deduplication pipeline and then used to train large‑scale language models, improve semantic search algorithms, and build industry‑specific knowledge graphs. FocusedCrawler Inc. also sells access to curated, anonymised subsets of the collected data under a commercial license. The company states that personally identifiable information (PII) is automatically stripped before storage.

⚙️ Rate Limiting Policy

Because focused_crawler can sustain up to 15 requests per second and does not include backoff logic beyond robots.txt directives, web administrators should apply threshold‑based rate limiting (e.g., block if >20 requests from a single IP in a 2‑second window for over 30 seconds). This policy is not punitive—the bot is legitimate—but necessary to protect server resources and maintain application performance for real users.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.