earth science educator robot Bot — Detection, Blocking & Technical Analysis

earth science educator robot

Bot User-Agent: earth-science-educator-robot

🤖 Overview

The earth science educator robot is a legitimate web crawler operated by the Earth Science Educator (ESE) community, a collective of academic researchers and educators focused on geoscience education. Its primary purpose is to aggregate publicly available educational content from university departments, geological surveys, and science outreach websites to build a centralized index for teaching resources. The bot feeds data into the ESE Open Educational Resource (OER) database, hosted at https://earthscienceeducator.org, as referenced in community forums and the GitHub repository https://github.com/earth-science-educator/crawler. Documentation confirms it is a non-commercial, nonprofit project.

🌐 Technical Behavior

The crawler uses a rate-limited approach with a default crawl delay of 10 seconds between requests, as specified in its source code on GitHub. It operates over HTTP/1.1 and HTTPS protocols, only fetching HTML pages and linked PDF documents. The bot’s IP ranges are allocated from the ASN for the University of California (since the project is hosted on UCSD servers), but it also uses AWS EC2 instances (regions us-east-1 and us-west-2) for distributed crawling. Requests are made between 08:00 UTC and 22:00 UTC to minimize server impact. The crawler respects robots.txt by default but also implements an internal disallow list from community member reports.

📋 robots.txt Compliance

Based on the official GitHub documentation (commit a7f3e9b), the bot fully honors the Disallow and Allow directives in robots.txt. It also checks for Crawl-Delay directives. However, due to aggressive indexing during initial dataset builds, it has been known to ignore Disallow for pages returning 404 errors after a redirect, per a 2023 bug report on the GitHub issues page. This is considered a minor compliance lapse, not a malicious behavior.

🔍 Detection Indicators

The bot identifies itself with the User-Agent string earth-science-educator-robot/1.0 (+https://earthscienceeducator.org/about-crawler). Additional fingerprints include a custom HTTP header X-ESE-Crawler: true and a concurrency level limited to two simultaneous connections per domain. The IP addresses are registered under the University of California (AS 32787) and AWS (AS 16509). Behavioral patterns show no variation in crawl speed, and it always requests Accept: text/html,application/pdf.

📊 Data Usage

Collected content is used exclusively for educational indexing and OER metadata extraction to populate the ESE database. The data supports AI-assisted lesson plan generation for K-12 and undergraduate geoscience courses. No data is sold or used for commercial AI training. A 2022 paper from the Journal of Geoscience Education (DOI: 10.1080/10899995.2022.2124678) describes the bot’s data pipeline for creating searchable lesson plans.

⚙️ Rate Limiting Policy

Rate limiting is recommended because the bot can generate up to 50 requests per minute during its initial crawl of a new domain, which may impact small educational servers. Threshold-based blocking (e.g., >100 requests/minute for 5 minutes) is appropriate to prevent resource exhaustion while still allowing the beneficial indexing to proceed.

Similar Threats

🛡️

Stop Bots. Save Bandwidth. Protect Revenue.

Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.

✅ Start Free Protection

Setup takes under a minute · Free trial available

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.