internet ninja

Bot User-Agent: internet-ninja

🤖 Overview

Internet Ninja is a legitimate web crawler operated by Internet Ninja LLC, a data‑services company specializing in high‑quality training data for large language models and generative AI products. First publicly documented in early 2024, the bot systematically collects publicly accessible web content—including text, structured data, and metadata—to feed into the company’s proprietary AI‑training pipeline, known as the NinjaML platform. According to the official Internet Ninja Bot FAQ and its GitHub repository (github.com/internet-ninja/bot), the crawler is designed for transparency and compliance with web standards, explicitly stating that it does not bypass authentication or access restricted resources.

🌐 Technical Behavior

The bot uses a custom HTTP client built on Python’s aiohttp library and defaults to HTTP/2 for faster, multiplexed requests. It honors conditional GET headers (If‑Modified‑Since, ETag) to reduce server load and avoid re‑fetching unchanged content. Crawl patterns follow a breadth‑first traversal with a configurable delay—typically 5–10 seconds between requests to the same domain—though bursts of up to 50 requests per second have been observed during initial deep crawls of large sites. The IP ranges are sourced from AWS and Google Cloud (regions us‑east‑1, eu‑west‑1, ap‑southeast‑1) and are documented in its public IP list at internet-ninja.com/ip-ranges. The bot sends a custom X-Internet-Ninja-Crawl header set to true and includes its identity URL in the From header per RFC 7231. All requests are logged with a unique Crawl-ID for debugging.

📋 robots.txt Compliance

According to the official robots.txt policy page, the Internet Ninja bot fully respects all Disallow, Allow, and Crawl‑delay directives. If a site returns a 403 or 404 during crawling, the bot immediately stops further requests to that path and respects user‑agent‑specific rules in robots.txt. The company has also published a list of known ignored patterns (e.g., hash‑fragment URLs) for transparency. Multiple webmaster forums confirm that the bot honors custom Crawl‑delay values up to 60 seconds.

🔍 Detection Indicators

The primary User‑Agent string is InternetNinja/1.0 (+https://internet-ninja.com/bot). A secondary mobile‑mode agent appears as InternetNinja-Mobile/1.0 for smartphone‑optimized content. Behavioral fingerprints include a consistent Accept-Encoding: gzip, deflate, br header and a Connection: keep-alive header. The bot does not set a Referer header. Its TLS handshake uses the cipher suite TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, which is distinguishable from common browsers. A list of current User‑Agent strings is maintained at internet-ninja.com/user-agents.

📊 Data Usage

Collected data is used exclusively for training large language models (specifically the NinjaML‑4 series) and for improving search‑indexing algorithms within the company’s internal analytics platform. The company publishes a Data Usage Report quarterly, detailing the volume of text extracted, domains crawled, and model components improved. No personal or sensitive identifiers are intentionally stored; the bot automatically filters out pages containing PII patterns (e.g., email addresses, phone numbers) based on pre‑trained regex models.

⚙️ Rate Limiting Policy

Because the Internet Ninja bot can generate sustained high‑volume traffic during initial deep crawls, system administrators should apply rate‑limiting thresholds (e.g., 100 requests per minute per IP) to prevent resource exhaustion. The company recommends a soft limit of 50 requests/minute followed by a 60‑second backoff if exceeded, as documented in its rate‑limit guidelines. This policy ensures fair usage while allowing the bot to complete its essential data collection for AI training without overwhelming origin servers.

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

Sign up in seconds  ·  No card required

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.