iecheck

Bot User-Agent: iecheck

🤖 Overview

iecheck is a legitimate web crawler operated by Internet Archive, specifically used for their Heritrix crawling infrastructure to verify the integrity and availability of archived web pages. According to the Internet Archive’s official documentation and the heritrix GitHub repository (https://github.com/internetarchive/heritrix3), iecheck is a verification bot that periodically re-checks URLs that were previously archived to ensure they remain accessible and that the archived content matches the live version. Its primary purpose is quality assurance and metadata validation for the Wayback Machine and other Internet Archive services.

🌐 Technical Behavior

iecheck performs targeted, single-URL requests rather than broad crawls. It typically requests a specific page identified by a URL and compares HTTP response headers, status codes, and content checksums against stored archival records. The bot uses HTTP/1.1 with conditional GET headers (If-Modified-Since, If-None-Match) to minimize server load. Requests are made from IP ranges assigned to the Internet Archive, which are publicly listed in their ASN (AS# 13693). According to the Internet Archive’s robot policy page, iecheck does not crawl more than one request per second per domain and respects robots.txt directives. The User-Agent string typically includes the version number of the crawler, such as "iecheck/1.0". No aggressive parallelism or concurrent connections are reported.

📋 robots.txt Compliance

The Internet Archive explicitly states on their robots.txt information page (https://archive.org/details/archive.org-robots.txt) that iecheck honors Disallow directives. The bot is designed to obey standard robots exclusion rules, and the Internet Archive encourages site owners to use robots.txt to prevent crawling of specific paths. There is documented evidence that iecheck checks robots.txt before making any request and will skip any URL that is disallowed.

🔍 Detection Indicators

The primary User-Agent string is iecheck/1.0 or Mozilla/5.0 (compatible; iecheck/1.0; +https://archive.org/details/archive.org-bots). Additional identifying headers include From: [email protected] and a X-Archive-Originating-IP header that reflects the server’s IP. The bot does not spoof browser signatures and always identifies itself clearly. The Internet Archive publishes a list of all their bot User-Agents on their official bots page.

📊 Data Usage

Data collected by iecheck is used solely for internal quality assurance of the Internet Archive’s web archiving systems. The bot verifies that archived snapshots remain consistent with live pages, detects link rot, and reports HTTP status changes. No data is sold or used for AI training. The results feed into the Wayback Machine’s metadata system to flag broken or altered pages.

⚙️ Rate Limiting Policy

iecheck is rate-limited because its periodic verification requests can be frequent if a domain hosts many archived pages. The rationale for threshold-based blocking is to prevent excessive load on small or fragile websites, while still allowing the Internet Archive to maintain the integrity of its collections. Standard rate limits of 1 request per second are documented.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.