LCC

Bot User-Agent: lcc

🤖 Overview

The LCC crawler is operated by the Library of Congress as part of its Web Archiving Program, which began in 2000 under the direction of the National Digital Information Infrastructure and Preservation Program (NDIIPP). Its mission is to capture and preserve publicly accessible web content—including government documents, cultural heritage sites, and academic resources—for long-term historical and research purposes. The collected data feeds into the Library’s digital collections, accessible via the loc.gov web archive interface and partner repositories like the Internet Archive.

🌐 Technical Behavior

The LCC crawler is built on the open-source Heritrix web crawling framework, developed by the Internet Archive. It prioritizes .gov, .edu, and .mil domains but also traverses general web content relevant to the Library’s collecting scope. Crawl frequency is governed by a politeness policy that typically limits requests to 1–2 per second per domain, though burst rates can reach 5 per second during seed discovery. The crawler uses HTTP/1.1 with gzip compression and supports both HTTP and HTTPS. IP addresses for the crawler originate from the Library of Congress’s assigned blocks, primarily 140.147.249.0/24, as documented in the official Library of Congress network registry. The bot identifies itself via the `User-Agent` header and includes a `From` header referencing the Library’s contact address.

📋 robots.txt Compliance

The LCC crawler officially honors robots.txt directives, as stated in the Library of Congress Web Archiving technical documentation at loc.gov/webarchiving/technical. It observes standard exclusion protocols, including `Disallow` rules and `Crawl-Delay` directives. However, as a preservation-oriented crawler, it may make narrow exceptions for sites that explicitly permit archiving via metadata tags (e.g., `allow` directives in robots.txt) or that are part of a mandated federal record collection.

🔍 Detection Indicators

Primary User-Agent strings: LCC (http://www.loc.gov/crawler/) or Mozilla/5.0 (compatible; LCC; http://www.loc.gov/crawler/). Additional identifying headers include a `From` field set to [email protected] and a `Referer` field often pointing to loc.gov seed lists. Reverse DNS lookups resolve to hostnames like crawler.loc.gov, confirming the bot’s origin.

📊 Data Usage

Collected data is stored in the Library of Congress’s web archive and made available to researchers, historians, and the public through a searchable interface similar to the Wayback Machine. The archive is used for scholarly analysis, government transparency, and cultural preservation. Data is not used for commercial AI training, advertising, or any monetized purpose—its sole function is archival stewardship.

⚙️ Rate Limiting Policy

The LCC crawler is rate-limited by webmasters primarily because its broad crawl scope and persistent re-crawling of millions of URLs can strain server resources. Threshold-based blocking is a reasonable defense against aggressive archiving schedules, allowing servers to prioritize human traffic while still accommodating the bot’s legitimate preservation mission at a controlled pace.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.