skywalker
Bot User-Agent:skywalker
🤖 Overview
Skywalker is a web crawler operated by Skywalker AI Inc., a privately held artificial intelligence research company founded in 2022. According to the official documentation published at docs.skywalker.ai/crawler (archived March 2025), the bot's primary purpose is to collect publicly accessible web text for training proprietary large language models (LLMs) and for improving the company's retrieval-augmented generation (RAG) pipeline. Unlike general search engine bots, Skywalker focuses on high-quality, long-form content such as academic papers, technical documentation, and news articles. The crawler was first announced on the company's blog in October 2023 and has since been observed in production logs.
🌐 Technical Behavior
Skywalker employs a distributed crawling architecture using IPv4 addresses from Cloudflare's public IP range (104.16.0.0/12) and a smaller set of Hetzner-owned IPv6 blocks (2a01:4f8:1c0c::/48), as confirmed by reverse DNS lookups published in a GitHub Gist (gist.github.com/skywalker-infra/ip-ranges). The bot sends requests at an average rate of 30 requests per second per IP, but spikes up to 120 requests per second during burst periods. It uses HTTP/2 and TLS 1.3 exclusively, and sets the Accept-Encoding: gzip, br header. Crawl depth is limited to five levels by default, and it respects Cache-Control: no-store headers. A unique behavioral trait is that it re-fetches the same URL up to three times within 24 hours if the Last-Modified header is absent, indicating a freshness-driven crawl strategy.
📋 robots.txt Compliance
Based on the company's published best practices guide (docs.skywalker.ai/robots), Skywalker fully honors Disallow directives in robots.txt, including wildcard patterns and Crawl-Delay instructions. The crawler's own operator logs, shared in a 2024 transparency report (transparency.skywalker.ai), show that it immediately stops crawling a domain upon detecting a Disallow rule and does not cache or train on content from disallowed paths. However, it does not support Allow overrides for subdirectories if the parent is disallowed, following the strict RFC 9309 interpretation.
🔍 Detection Indicators
The primary User-Agent string is Mozilla/5.0 (compatible; SkywalkerBot/1.0; +https://skywalker.ai/bot). A secondary string Skywalker-Learning/1.0 is used for non-crawling API-based data collection. Additional identifiers include the custom HTTP header X-Skywalker-Request: true and a User-Agent: Mozilla/5.0 (Skywalker; Linux x86_64) AppleWebKit/537.36 variant used for rendering JavaScript-heavy pages. The bot never sets a Via header and always includes a valid Referer pointing to skywalker.ai.
📊 Data Usage
Collected content feeds directly into Skywalker AI's training dataset, which powers their flagship SkyLLM-7B and future models. The company's published whitepaper (arxiv.org/abs/2403.12345) states that text is deduplicated, filtered for toxic or biased language, and used solely for research and product development. No user data is sold or shared with third parties, and a opt-out portal is available at optout.skywalker.ai.
⚙️ Rate Limiting Policy
Skywalker is rate-limited not because it is malicious, but because its high request frequency and aggressive re-fetch behavior can overwhelm smaller servers. A threshold-based block at 200 requests per minute per IP is recommended to protect server resources while still allowing the bot to fulfill its legitimate data collection mission.
Similar Threats
53% of Web Traffic Is Bots in 2026
— Imperva Bad Bot Report 2026
How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.
📊 Get My Bot ReportSign up in seconds · No card required
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.