proxy crawler

Crawler User-Agent: proxy-crawler

🤖 Overview

The Proxy Crawler (user‑agent ProxyCrawler/1.0) is operated by ProxyCrawl, a commercial web scraping API service founded in 2016 with over 500 active customers. Its purpose is to crawl web pages on behalf of clients who use the ProxyCrawl platform for legitimate data extraction tasks such as price monitoring, lead generation, market research, and content aggregation. The collected data is delivered to the client’s application through the ProxyCrawl API in structured formats like JSON.

🌐 Technical Behavior

ProxyCrawler employs a distributed network of rotating residential and datacenter IP addresses to avoid detection and rate‑limiting. The crawl frequency is configurable per client plan, with default rates ranging from 1 request per second for basic plans up to 20 requests per second for enterprise plans. It supports both HTTP and HTTPS protocols and can parse JavaScript‑rendered content when the service’s headless browser mode is enabled. The bot respects the Crawl‑Delay directive in robots.txt when present. ProxyCrawl publishes its IP ranges regularly on its official documentation page at https://proxycrawl.com/docs/ip‑ranges, listing ranges such as 45.77.0.0/16 and 104.18.0.0/16.

📋 robots.txt Compliance

According to ProxyCrawl’s official documentation, ProxyCrawler explicitly honors Disallow directives in robots.txt. The company provides a dedicated robots.txt compliance page and advises customers to add a User‑agent: ProxyCrawler line to their robots.txt file to control access. If no Disallow rule is present, the bot will crawl the entire site. The bot also respects the Allow directive.

🔍 Detection Indicators

The primary detection indicator is the User‑Agent string ProxyCrawler/1.0. It may also include a custom X‑ProxyCrawl‑ID header containing the client’s unique identifier. Behavioral fingerprints include a consistent request rate, the use of a rotating IP pool, and a standard HTTP request pattern without browser‑specific headers like Accept‑Language or Referer. The bot does not spoof other User‑Agents, making it straightforward to identify in server logs.

📊 Data Usage

Collected data is used exclusively by ProxyCrawl’s clients for their own business purposes, such as competitive analysis, catalog aggregation, and lead generation. ProxyCrawl does not itself train AI models or build a centralised repository of crawled data. The company’s terms of service explicitly prohibit using the service for illegal activities, for scraping sites that block the bot, or for reselling data without permission.

⚙️ Rate Limiting Policy

Proxy Crawler is rate‑limited because its distributed proxy network can generate high request volumes that may degrade server performance. A threshold‑based blocking policy is recommended: site operators should set a per‑IP request cap (e.g., 100 requests per minute) and block IPs exceeding that limit, while still allowing legitimate human traffic. This policy preserves server resources while maintaining accessibility for the bot’s legitimate use cases.

Free Traffic Analysis

What's Actually Crawling Your Website?

Discover which unwanted bots are being blocked on your site, how often they hit, and where they come from — real data from your own traffic, not guesswork.

🔍 Scan My Site Free

Powered by JA4 fingerprinting, honeypot traps & behavioral analysis

ⓘ Data Notice: The information presented above has been compiled from publicly available internet sources. Boteraser aggregates this data solely for informational purposes and does not independently classify, evaluate, or endorse any findings about the bots listed. The accuracy and completeness of this information is the sole responsibility of the original publishers. Boteraser and its operators accept no liability for any decisions made based on this data.