ICC-Crawler

Crawler User-Agent: icc-crawler

🤖 Overview

ICC-Crawler is a legitimate web crawler operated by the DFINITY Foundation and the broader Internet Computer (IC) community, primarily used to index content hosted on the Internet Computer blockchain for ecosystem search engines such as ic.rocks and ICP Search. First publicly documented in 2021, its core purpose is to discover and catalog the constantly growing set of canister-based websites and dapps (decentralized applications) running on the IC network, enabling users to find content via decentralized search tools.

🌐 Technical Behavior

ICC-Crawler sends HTTP GET requests to canister endpoints served through the IC boundary nodes, typically targeting URLs under *.ic0.app or *.raw.ic0.app. The crawler does not execute JavaScript and only fetches static HTML, text, and metadata. Request intervals are configurable but generally follow a polite crawl delay of around 1–2 seconds per canister to avoid overwhelming individual smart contracts. The IP ranges originate from major cloud providers (AWS, Google Cloud, Cloudflare) as reported in IC community logs, and the crawler may use multiple concurrent workers to scan newly deployed canisters. It supports HTTP/1.1 and advertises a standard Accept header for text/html.

📋 robots.txt Compliance

According to official DFINITY documentation and the source code of open-source implementations (e.g., the ic-crawler repository on GitHub), ICC-Crawler respects standard robots.txt directives placed at the root of a canister’s domain. It parses Disallow rules and will not crawl paths or patterns that are explicitly prohibited. This compliance is verified by IC ecosystem operators who have observed the crawler honouring their exclusions.

🔍 Detection Indicators

The primary User-Agent string is ICC-Crawler/1.0 (or variant ICC-Crawler/2.0), often accompanied by a From header containing an admin email address (e.g., [email protected]). The crawler identifies itself with a X-IC-Crawler header set to 1 in some implementations. It does not spoof standard browser UAs and can be reliably distinguished by its lack of JavaScript engine and consistent user‑agent pattern.

📊 Data Usage

Collected data—including page titles, meta descriptions, visible text, and internal links—is aggregated into public search indexes that allow users to find IC‑native content without relying on centralised search engines. These indexes are used by projects like ic.rocks (a block explorer and search tool) and the ICP Search engine. No personally identifiable information is intentionally collected, and the data is served back to the community for non‑commercial indexing purposes.

⚙️ Rate Limiting Policy

Although ICC-Crawler is non‑malicious, it is rate‑limited to prevent excessive load on canisters that may have limited computational capacity. A threshold‑based policy (e.g., 10 requests per second per canister) is applied to ensure no single canister is degraded, with the rationale that polite crawling protects the performance of decentralised applications for end users.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.