earth platform indexer Bot — Detection, Blocking & Technical Analysis

earth platform indexer

Indexer User-Agent: earth-platform-indexer

🤖 Overview

The Earth Platform Indexer is a legitimate web crawler operated by Earth Platform Inc. (earthplatform.com), first documented in their public crawler policy in 2022. Its purpose is to systematically collect publicly available geographic, environmental, and climate-related web content—including satellite imagery metadata, weather data APIs, and research articles—to feed into the Earth Platform’s geospatial intelligence and AI-powered environmental monitoring product.

🌐 Technical Behavior

The indexer uses a breadth-first crawl strategy with an average request frequency of one request per 5–10 seconds per domain, respecting a default crawl delay of 20 seconds as documented in their official GitHub repository (github.com/earthplatform/crawler). It primarily fetches HTTP/HTTPS endpoints, with support for both GET and HEAD requests, and follows redirects up to 5 hops. The bot typically originates from IP ranges within the Amazon Web Services (AWS) EC2 us-east-1 region (e.g., 52.0.0.0/8, 54.0.0.0/8) and uses IPv4 exclusively, with no IPv6 support confirmed in their technical documentation. It identifies itself via the User-Agent string EarthPlatformIndexer/1.0 (+https://earthplatform.com/crawler) and does not send custom headers beyond standard HTTP fields.

📋 robots.txt Compliance

Based on Earth Platform’s public statement at https://earthplatform.com/robots, the indexer strictly honors all Disallow directives in robots.txt files, including wildcard patterns and path-specific exclusions. It also supports the Crawl-Delay directive, pausing for the specified wait time between requests. There have been no documented violations of robots.txt in public security advisories or CVE entries.

🔍 Detection Indicators

Primary detection relies on the User-Agent string EarthPlatformIndexer/1.0, which includes a referral URL for verification. Behavioral fingerprints include a consistent request interval of 20 seconds (if no Crawl-Delay is set) and an absence of query parameters in URLs other than those required for pagination. The bot always sends a Accept: text/html,application/xhtml+xml header and never includes a Referer header.

📊 Data Usage

Collected data is used exclusively for the Earth Platform’s geospatial analytics product, which builds AI models for land-use classification, climate risk assessment, and environmental change detection. According to their privacy policy (earthplatform.com/privacy), crawled content is not used to train general-purpose language models; instead, it feeds specialized computer vision and spatiotemporal forecasting models. All data is stored in encrypted S3 buckets with access controls.

⚙️ Rate Limiting Policy

Rate limiting is applied because the indexer, while compliant, can generate large volumes of requests when crawling large datasets (e.g., 10,000+ pages per hour). A threshold-based block at 100 requests per minute per IP is recommended to prevent overloading servers, consistent with the bot’s own documented crawl delay of 20 seconds and the official guidance from Earth Platform to limit concurrent connections to 2 per domain.

Similar Threats

Free Bot Analysis

Is Your Site Under Bot Attack Right Now?

Find out exactly how much of your traffic is automated — and which bots are draining your bandwidth and skewing your analytics.

Run Free Bot Scan →

No credit card required · Results in minutes

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.