lmqueuebot
Bot User-Agent:lmqueuebot
🤖 Overview
lmqueuebot is a legitimate web crawler operated by the LM Queue open-source project, hosted on GitHub under the repository lmqueue/lmqueuebot. Its primary purpose is to gather publicly accessible web content, including text, code, and documentation, to build high-quality training datasets for large language models (LLMs). The bot is explicitly designed to support the development of open-source AI models by providing diverse and representative data sources. According to the project's official README, the crawler is intended for non-commercial research and is widely used in academic and community-driven AI training initiatives.
🌐 Technical Behavior
lmqueuebot operates using asynchronous HTTP requests with a default crawl rate of 1 request per second per domain, as documented in the project’s configuration files. It supports both HTTP/1.1 and HTTP/2 protocols and respects standard robots.txt directives. The bot typically crawls from IP addresses belonging to major cloud providers such as AWS, Google Cloud, and DigitalOcean, with ranges dynamically assigned. It does not use headless browsers; instead, it performs simple GET requests with a configurable concurrency limit. The crawler also respects Crawl-Delay directives in robots.txt when present. It employs a queue-based architecture to manage requests across multiple domains, preventing overload. The bot is known to follow links recursively but limits depth to 3 levels by default.
📋 robots.txt Compliance
Based on the project’s documentation and community reports, lmqueuebot fully honors Disallow rules in robots.txt. The bot reads robots.txt before crawling each domain and caches the results for 24 hours. Evidence from the GitHub issues page confirms that the developers have implemented explicit checks for robots.txt compliance, including handling of wildcard patterns. No known violations have been reported in security advisories.
🔍 Detection Indicators
The primary identifier is the User-Agent string: lmqueuebot/1.0 (or lmqueuebot/2.0 for newer versions). Additionally, the bot sets an X-LMQueue-Request header with a unique request ID. Behavioral fingerprints include a consistent 1-second delay between requests, and the bot does not execute JavaScript. It typically requests text/html content only.
📊 Data Usage
Collected data is used exclusively for AI training and research purposes, specifically to build curated datasets for fine-tuning and evaluating language models. The project’s license states that crawled content may be released as open-source datasets under permissive licenses such as CC-BY. The bot does not store personal identifiable information; it filters out any content containing email addresses or phone numbers.
⚙️ Rate Limiting Policy
lmqueuebot is rate-limited because its crawling can generate significant traffic over extended periods, potentially impacting server performance. The policy recommends a threshold-based block at 100 requests per minute per IP, with a soft ban for exceeding 500 requests in 5 minutes. This ensures fair usage without permanently blocking the bot, which remains a legitimate agent.
Similar Threats
⚠️
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.
Check My Site for FreeFree to start · Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.