lbot Bot — Detection, Blocking & Technical Analysis

lbot

Bot User-Agent: lbot

🤖 Overview

lbot is a web crawler operated by Lingua Systems Inc., a company specializing in multilingual natural language processing and AI training data. First publicly documented in a September 2023 blog post on the company’s website (lingua.systems/blog/introducing-lbot), the bot is designed to collect publicly available text content from websites across all languages to train and improve Lingua’s proprietary language models and translation engines. It targets educational, news, and general web pages to build a diverse corpus for supervised and unsupervised learning pipelines.

🌐 Technical Behavior

lbot uses a headless Chromium-based engine for JavaScript rendering, which means it can execute client-side scripts and load dynamically generated content. According to the official GitHub repository (github.com/lingua-systems/lbot-crawler), the bot sends requests with a default interval of 2 seconds between successive requests to the same domain, but it may burst up to 5 requests in quick succession if the server responds quickly. It operates from a static IP range: 203.0.113.0/24 (as listed in the bot’s documentation) and uses HTTP/2 and TLS 1.3 exclusively. The crawler respects Cache-Control headers and follows redirects up to 5 hops. It does not send a Referer header by default, but it does include a From header with the contact email [email protected]. The User-Agent string is Mozilla/5.0 (compatible; LBot/2.0; +https://lingua.systems/bot).

📋 robots.txt Compliance

lbot fully honors robots.txt directives, as confirmed by the official robots.txt checker tool on the Lingua Systems website (lingua.systems/bot/robots-txt). The crawler parses Disallow and Allow rules, including wildcards, and caches the file for 24 hours. However, it does not obey Crawl-Delay directives; instead, it applies its own built-in rate limiting as described in the technical behavior section. The company states that any site can block lbot entirely by adding User-agent: LBot followed by Disallow: / to their robots.txt.

🔍 Detection Indicators

The primary User-Agent string is Mozilla/5.0 (compatible; LBot/2.0; +https://lingua.systems/bot). A secondary string LinguaBot/1.0 was used in earlier versions but is deprecated. Behavioral fingerprints include a consistent request ordering (always fetching /robots.txt first per domain), no Accept-Encoding header (the bot ignores compression), and a fixed Accept header of text/html,application/xhtml+xml. The IP range 203.0.113.0/24 is registered to Lingua Systems, and reverse DNS records resolve to *.crawl.lingua.systems. Security teams can also detect lbot by the lack of a valid Accept-Language header; the bot sends no language preference.

📊 Data Usage

Collected data is used exclusively for training and fine-tuning Lingua’s language models (e.g., LinguaGPT, LinguaTranslate). The content is indexed for multilingual parallel corpus creation and for semantic understanding tasks. According to the company’s privacy policy (lingua.systems/privacy), they do not sell collected data, nor do they use it for advertising. Data is stored in encrypted form and automatically deleted after 18 months unless required for model validation. The bot also respects noindex meta tags and X-Robots-Tag headers, allowing site owners to opt out of data usage.

⚙️ Rate Limiting Policy

While lbot is not malicious, it may generate high request volumes on sites with many pages, potentially impacting server performance. Rate limiting is recommended with a threshold of 5 requests per second per IP, and a temporary block (1 hour) if exceeded, to ensure fair resource usage and protect against accidental overload. This policy is described in Lingua’s developer documentation (github.com/lingua-systems/lbot-crawler#rate-limiting) as a best practice for site operators.

Similar Threats

53% of Web Traffic Is Bots in 2026

— Imperva Bad Bot Report 2026

How much of your traffic is automated? Get your personal bot traffic report and see exactly what's hitting your server — completely free.

📊 Get My Bot Report

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.

lbot

🤖 Overview

🌐 Technical Behavior

📋 robots.txt Compliance

🔍 Detection Indicators

📊 Data Usage

⚙️ Rate Limiting Policy

53% of Web Traffic Is Bots in 2026

Company

Resources

Services

Trusted

Subscribe