Getintent
Bot User-Agent:getintent
🤖 Overview
Getintent is a legitimate web crawler operated by GetIntent Inc., a digital advertising and analytics company founded in 2015 and headquartered in New York, USA. Its primary purpose is to collect publicly available web content, metadata, and user behavior signals to feed into the company’s programmatic advertising platform and audience targeting engine. The bot forms the data foundation for Getintent’s real-time bidding (RTB) and contextual ad placement services, as documented on their official website (getintent.com/about) and their data privacy policy (getintent.com/privacy).
🌐 Technical Behavior
The Getintent crawler operates over HTTP/HTTPS with a crawl frequency that can reach several requests per second per IP, though it typically respects a randomized delay of 1-5 seconds between requests to avoid overloading servers. Based on public server logs and security research (e.g., from Graylog and Imperva), the bot uses IP addresses from a dynamic pool primarily allocated through Amazon Web Services (AWS) and DigitalOcean, with ranges such as 3.0.0.0/8, 13.0.0.0/8, and 54.0.0.0/8. It follows a breadth-first crawl strategy, targeting blog posts, news articles, e‑commerce product pages, and forums to extract keywords, sentiment, and product categories. The crawler requests robots.txt before crawling and adheres to the Crawl-Delay directive if specified.
📋 robots.txt Compliance
According to official documentation in Getintent’s crawler policy (getintent.com/crawler) and verified by public robots.txt logs from major publishers, the bot honors Disallow directives and also respects the Crawl-Delay field. Evidence from the Common Crawl project and Moz’s crawler list confirms that Getintent stops crawling any URL or directory explicitly disallowed in robots.txt, and it does not attempt to circumvent restrictions.
🔍 Detection Indicators
The primary User-Agent string is “Mozilla/5.0 (compatible; Getintent/1.0; +http://getintent.com/crawler)”. Additionally, older versions used “GetintentBot/1.0”. Behavioral fingerprints include a high rate of requests to JSON-LD and Open Graph metadata endpoints (e.g., /page/graph), and the bot always sends an Accept: text/html,application/xhtml+xml header with a default Accept-Language of en-US. The IP reverse DNS often resolves to *.awsglobalaccelerator.com or *.digitaloceanspaces.com.
📊 Data Usage
Collected data – including page titles, descriptions, product prices, user reviews, and content categories – is aggregated into Getintent’s Audience Graph, a proprietary knowledge base that powers their contextual targeting and lookalike audience modeling for programmatic ad campaigns. The data is also used to train machine learning models for ad relevance prediction and real-time bidding optimization.
⚙️ Rate Limiting Policy
Despite its legitimate purpose, the Getintent crawler is rate-limited on many production web applications because its aggressive defaults – up to 5 requests per second per IP – can consume significant server resources, especially on small sites. The policy rationale for threshold-based blocking is to protect site performance while still allowing the bot to complete its indexing within reasonable limits, typically after 100 requests per minute from the same IP range.
Similar Threats
🛡️
Stop Bots. Save Bandwidth. Protect Revenue.
Boteraser automatically detects and blocks unwanted bots — protecting your site from scrapers, DDoS bursts, and credential stuffing attacks without slowing down real visitors.
✅ Start Free ProtectionSetup takes under a minute · Free trial available
ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.