litlrbot
Bot User-Agent:litlrbot
🤖 Overview
litlrbot is a web crawler operated by LitLR Systems, a privately held AI research company founded in 2019. According to the official bot policy page at https://litlr.com/bot, its primary purpose is to collect publicly accessible text and structured data from the web to train and improve LitLR’s proprietary large language models and downstream NLP products. The crawler was first observed in server logs in early 2022 and has since become a consistent presence on a wide range of content sites.
🌐 Technical Behavior
litlrbot performs breadth‑first crawls with a maximum request rate of approximately 1 request per 5 seconds per domain, as documented in the company’s technica l overview published on their GitHub repository at https://github.com/litlr/crawler. It uses HTTP/1.1 with keep‑alive connections and supports both IPv4 and IPv6; IP ranges announced via ASN AS209437 (LitLR Systems) and parts of AS396982. The bot typically fetches robots.txt first, then HTML pages, and subsequently downloads linked CSS, JavaScript, and image resources only if they are referenced in the page’s content. It does not follow redirect chains beyond three hops and respects Retry-After headers when receiving 429 or 503 status codes. Crawl depth is limited to five levels from the seed URL, and the bot avoids in‑page links with rel="nofollow" attributes.
📋 robots.txt Compliance
LitLR Systems states in its official documentation that litlrbot strictly honours Disallow directives in robots.txt, including wildcard patterns and paths that require authentication or contain sensitive information. Internal analysis by webmasters (reported on the LitLR blog at https://litlr.com/blog/litlrbot‑robots) confirms that the bot has never been observed crawling disallowed paths, and it also respects Crawl‑Delay directives.
🔍 Detection Indicators
The primary User‑Agent token is Mozilla/5.0 (compatible; litlrbot/1.0; +https://litlr.com/bot), with secondary fallback strings such as litlrbot/1.0 and LitLR‑Crawler/1.0. Identifying HTTP headers include a custom X‑LitLR‑Request‑ID header, a From header set to [email protected], and a User‑Agent that always contains the lowercase string litlrbot. Behaviourally, the bot sets a Referer header equal to the URL being crawled and never sends cookies.
📊 Data Usage
Data collected by litlrbot is used exclusively for training LitLR’s language models, which are released as open‑source checkpoints under the LitLR‑1 and LitLR‑2 series on Hugging Face (as noted in https://huggingface.co/litlr). The company also uses the scraped content to build domain‑specific fine‑tuning datasets for tasks such as summarization, question answering, and sentiment analysis.
⚙️ Rate Limiting Policy
litlrbot is rate‑limited to prevent it from overwhelming origin servers; the recommended threshold is 20 requests per 60 seconds per IP, after which a 429 with a Retry‑After of 30 seconds should be returned. This policy aligns with the bot’s own documentation and ensures fair resource sharing with human visitors and other crawlers.
Similar Threats
⚠️
Your Site May Be Hemorrhaging Revenue to Bots
Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.
Check My Site for FreeFree to start · Cancel anytime
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.