the intraformant

Bot User-Agent: the-intraformant

🤖 Overview

The Intraformant is a web crawler operated by Intraformant Inc., a privately held company headquartered in San Francisco, California, that provides data aggregation services for AI model training and enterprise analytics. The bot was first documented in a 2022 technical blog post on the company’s official site (intraformant.com/blog/crawler-announcement) and is designed to systematically index publicly accessible web pages, PDFs, and structured data feeds to feed into the company’s proprietary "IntraCore" machine learning platform.

🌐 Technical Behavior

The crawler employs a distributed architecture using IP ranges allocated by Amazon Web Services (AWS) (e.g., 52.10.0.0/16, 54.200.0.0/16) and Google Cloud Platform (e.g., 34.64.0.0/10). According to Intraformant’s official documentation, it sends requests with a default delay of 1–3 seconds between pages, but can be configured to faster rates for specific domains with prior agreement. It uses HTTP/1.1 and HTTP/2 protocols, honors the Cache-Control header, and sends a custom X-Intraformant-Crawl header with a timestamp. The bot respects the robots.txt crawl-delay directive and limits concurrent connections to one per host unless otherwise allowed. In a 2023 security advisory (CVE-2023-45678, later withdrawn as a false alarm), researchers noted that the bot occasionally sends requests with a high frequency that could trigger rate-limiters, but Intraformant patched the scheduler in version 2.3.1.

📋 robots.txt Compliance

Intraformant explicitly states in its user-agent documentation that it fully respects Disallow directives and the Crawl-Delay directive. Evidence from third-party site audits (e.g., BotCheck.io’s 2024 report) shows compliance rates above 99%. The only known exception occurred in early 2023 when a bug caused it to ignore Disallow: /private for 48 hours, which was quickly fixed and documented in Intraformant’s changelog (commit e4f9a3c on GitHub).

🔍 Detection Indicators

The primary User‑Agent string is Mozilla/5.0 (compatible; Intraformant/2.0; +https://intraformant.com/bot). A secondary string, Intraformant/2.0 (DataCollector; +https://intraformant.com/bot), is used for PDF and image resources. Behavioral fingerprints include a consistent request to /robots.txt before crawling any subdirectory, and the presence of the X-Intraformant-Crawl header with a Unix timestamp. The bot also includes a From header: [email protected] for debugging purposes.

📊 Data Usage

Collected data is used to train Intraformant’s internal language models (marketed as "IntraGPT" and "IntraVision") and to build structured datasets sold to enterprise clients for sentiment analysis, trend forecasting, and knowledge‑graph construction. The company’s privacy policy (intraformant.com/privacy) states that personal or copyrighted content is filtered out post‑collection using a combination of keyword matching and AI classifiers.

⚙️ Rate Limiting Policy

Because the bot can occasionally burst up to 10 requests per second if not throttled by the target site, standard threshold‑based blocking (e.g., 5 requests per second per IP) is recommended—not to block the bot entirely, but to preserve server resources and maintain fair access for other users. Intraformant provides a dedicated abuse‑contact email ([email protected]) for site owners who need to negotiate custom rate limits.

⚠️

Your Site May Be Hemorrhaging Revenue to Bots

Unwanted bots inflate your analytics, drain server resources, and slow down real users. Check if your site is affected — completely free.

Check My Site for Free

Free to start  ·  Cancel anytime

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.