inetbot

Bot User-Agent: inetbot

🤖 Overview

inetbot is the primary web crawler operated by the Internet Archive, a non-profit digital library based in San Francisco, California, founded by Brewster Kahle in 1996. Its core purpose is to systematically collect publicly accessible web pages for long-term preservation in the Wayback Machine and other archival collections, ensuring the historical record of the web remains accessible. The bot is a direct descendant of the earlier archive.org_bot and is managed by the Internet Archive’s engineering team, with official documentation available at archive.org/details/archiveorg-bot.

🌐 Technical Behavior

inetbot employs a breadth-first crawl strategy, starting from seed URLs submitted by users or derived from the Wayback Machine’s existing corpus. It sends HTTP GET requests with a configurable delay, typically between 1 and 3 seconds per domain per host, to avoid overwhelming servers. The crawler uses HTTP/1.1 and supports both IPv4 and IPv6, with IP addresses drawn from the Internet Archive’s announced ranges—primarily 207.241.224.0/20 and 64.147.112.0/20, as listed in their ASN AS 7941. It also respects ETags and Last-Modified headers to avoid re-downloading unchanged content, and it can handle gzip and brotli compression. The crawler is designed to be polite: it follows 301/302 redirects but limits redirect depth to 10 hops, and it enforces a maximum file size of 50 MB per resource.

📋 robots.txt Compliance

inetbot fully honors robots.txt directives, including Disallow, Allow, and Crawl-Delay instructions, as confirmed by the Internet Archive’s official policy at archive.org/about/robots.txt. The bot reads the file at the root of each domain and caches it for 24 hours, rechecking it periodically. Historically, the bot was also designed to respect nofollow and noindex meta tags, though modern practice relies primarily on robots.txt to manage crawl scope.

🔍 Detection Indicators

The primary User-Agent string is inetbot/1.0, often appearing as inetbot/1.0 (http://archive.org/details/archiveorg-bot). Variants include archive.org_bot and internetarchivebot. Behavioral fingerprints include a consistent request pattern of GET /path HTTP/1.1 with a User-Agent header containing “archive.org” and a From header field with the bot’s contact email [email protected]. The bot does not send Accept-Language or Referer headers, distinguishing it from search engine crawlers.

📊 Data Usage

Collected data is exclusively used for archival purposes: it is stored in the Wayback Machine’s distributed storage cluster and made publicly available for historical research, link rot mitigation, and legal evidence. The Internet Archive also uses crawled content to generate derivative datasets such as the Common Crawl corpus (in collaboration), but does not sell or license the data for commercial AI training. Full crawl logs and metadata are published periodically at archive.org/details/archiveorg-web-crawls.

⚙️ Rate Limiting Policy

While inetbot is a legitimate, non-malicious crawler, it may still be rate-limited by webmasters to prevent excessive load on shared hosting environments or dynamic content servers. The recommended policy is to apply a threshold of 5 requests per second per IP, with a 60-second backoff if exceeded, as the bot’s crawl delay is designed to be cooperative but can still saturate low-capacity servers.

Free Bot Analysis

Is Your Site Under Bot Attack Right Now?

Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.

Run Free Bot Scan →

No credit card required  ·  Results in minutes

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.