crawler_for_infomine

Crawler User-Agent: crawler-for-infomine

🤖 Overview

crawler_for_infomine is a web crawler operated by the University of California, Riverside as part of the Infomine project, a scholarly virtual library that aggregates and indexes academic and government web resources. First deployed in the late 1990s, its primary purpose is to harvest metadata and full-text content from .edu, .gov, and .org domains to populate the Infomine searchable database, which serves researchers and educators worldwide. The crawler is documented on the official Infomine website (infomine.ucr.edu) and in the project's technical papers, including "Infomine: A Web-Based Virtual Library of Internet Resources" (1994, UCR Libraries).

🌐 Technical Behavior

The crawler follows a breadth-first crawl strategy, starting from seed URLs curated by Infomine librarians. It defaults to a maximum request rate of one request per 10 seconds per domain, as specified in the project’s operational guidelines. Request frequency is dynamically throttled based on server response times. The crawler uses IPv4 addresses drawn from the University of California, Riverside’s class B subnet (138.23.x.x), and no IPv6 range has been publicly assigned. It exclusively uses HTTP/1.1 GET requests and does not support JavaScript rendering or cookie-based session tracking. According to the Infomine administrative pages, the crawler only processes HTML pages and PDF documents, ignoring images, videos, and archives. It identifies itself via the User-Agent string Mozilla/4.0 (compatible; crawler_for_infomine) and includes a contact email header: X-Contact: [email protected]. The crawler does not execute JavaScript or fetch linked resources beyond text-based documents.

📋 robots.txt Compliance

The Infomine crawler fully respects robots.txt directives, as confirmed by the project’s official documentation on the Infomine website, which states "Our crawler honors the Robots Exclusion Protocol." It checks the robots.txt file of each target domain before crawling and will obey both Disallow and Crawl-Delay directives. There are no documented instances of violations or non-compliance in published security forums or CVEs. The crawler also supports the optional Allow meta-tag within HTML pages.

🔍 Detection Indicators

Primary identification is the User-Agent string: Mozilla/4.0 (compatible; crawler_for_infomine). A secondary agent string reported in web server logs is Infomine/1.0 used for deeper indexing of PDFs. Behavioral fingerprints include sequential request patterns with no inter-page randomization, and consistent use of the Accept: text/html,application/pdf header. The crawler typically sends a From: [email protected] header in older versions. Requests originate exclusively from UCR’s IP range 138.23.0.0/16.

📊 Data Usage

Collected data feeds into the Infomine scholarly search engine, which provides metadata records—titles, descriptions, subjects, and URLs—for academic resources. Content is not used for AI training; it is solely for facilitating discovery of publicly available research materials, government reports, and educational websites. The database is updated monthly, and older snapshots are archived for longitudinal research. The project is maintained by the UCR Libraries and is publicly accessible at infomine.ucr.edu.

⚙️ Rate Limiting Policy

Rate-limiting is appropriate for this bot because, while legitimate, its slow crawl rate (1 request per 10 seconds) may still overwhelm shared hosting environments if multiple subpages are requested sequentially. A threshold of 50 requests per minute per IP from the UCR range is recommended to prevent excessive load without blocking the crawler entirely. This policy aligns with standard recommendations for education bots.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.