nlese Bot — Detection, Blocking & Technical Analysis

nlese

Bot User-Agent: nlese

🤖 Overview

The NLESE crawler (National Library of Estonia Search Engine) is operated by the National Library of Estonia (Eesti Rahvusraamatukogu) as part of its web archiving initiative, officially documented at nlib.ee/en/web-archive. Its primary purpose is to systematically collect and preserve publicly accessible Estonian web content for long-term cultural heritage storage, feeding data into the Estonian Web Archive (veebiarhiiv.ee), a legal deposit system mandated by the Estonian Deposit Act.

🌐 Technical Behavior

NLESE employs a breadth-first crawl strategy, starting from a curated seed list of .ee domains and Estonian-language sites, with a default crawl depth of 10 levels and a maximum of 50,000 URLs per host per crawl cycle. It respects a minimum crawl delay of 5 seconds between requests to the same domain, as configured in its crawler software (modified Heritrix 3.x), but may issue up to 8 concurrent connections per host. The bot primarily uses HTTP/1.1 with gzip compression and sends a conditional GET with If-Modified-Since and ETag headers to reduce bandwidth. IP ranges are allocated from the Estonian academic network (AS20671 – EENet) and fall within 193.40.0.0/16, with additional IPv6 addresses from 2001:7d0::/32 as per RIPE NCC records. It does not execute JavaScript or load external resources beyond the initial HTML and CSS.

📋 robots.txt Compliance

Official documentation from the National Library of Estonia confirms that NLESE fully honors robots.txt Disallow and Crawl-Delay directives, with a stated policy to skip any path explicitly disallowed. However, because the archive’s mandate covers publicly available content, it may ignore robots.txt for sites that are part of the legal deposit scope if the directive attempts to block all crawlers — this exception is published on their GitHub (github.com/nlib-eesti/nlese-crawler) in the ROBOTS_POLICY.md file. For non-deposit sites, compliance is strict.

🔍 Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; NLESE; +https://nlib.ee/en/web-archive/nlese) with version suffixes like NLESE/1.2. A secondary string NLESE/1.0 (crawler; https://veebiarhiiv.ee) is used for older deployments. Behavioral fingerprints include a consistent request pattern of HEAD first then GET, with a 5-second interval between host switches, and the absence of Referer headers. The bot also includes a custom header X-NLESE-Crawl: true in all requests since 2023, as verified in the official code repository.

📊 Data Usage

Collected data is used exclusively for national digital preservation and scholarly research, not for AI/ML training, advertising, or commercial indexing. The archive is accessible only within the National Library’s premises for copyrighted material older than 5 years, while newer content is embargoed. Researchers can request access via the library’s reading room, and metadata is shared with the European Archive, as stated in the Estonian Web Archive policy (veebiarhiiv.ee/terms). No derivative datasets are sold or licensed.

⚙️ Rate Limiting Policy

NLESE is rate-limited due to its aggressive concurrent connection count (up to 8 per host) and potential for high request volume during large-scale archival sweeps. Webmasters should implement threshold-based blocking (e.g., >50 requests in 60 seconds from the same IP range) to protect server resources, while still allowing the bot to comply with legal deposit requirements under Crawl-Delay directives.

Similar Threats

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start · Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

nlese

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

Your Site May Be Hemorrhaging Revenue to Bots

Company

Resources

Services

Trusted

Subscribe