Complete List of AI Crawlers (2025)
Keeping track of AI crawlers is like playing whack-a-mole. New ones pop up constantly, existing ones change their User-Agent strings, and some don't even identify themselves properly. This is my attempt at a definitive list, updated regularly as things change.
Last verified: December 2024.
The major players
These are the ones you'll see most often in your logs. They're from established AI companies and generally respect robots.txt (with some notable exceptions).
OpenAI
| User-Agent | Purpose | Respects robots.txt | |------------|---------|---------------------| | GPTBot | Training data collection | Yes | | ChatGPT-User | Real-time browsing for ChatGPT Plus | Yes | | OAI-SearchBot | SearchGPT web search | Yes |
OpenAI has been relatively transparent about their crawlers. GPTBot is the main one you'll want to block if you don't want your content in their training data. ChatGPT-User is different—it's used when someone asks ChatGPT to browse a specific URL, more like a regular user agent.
Some people block ChatGPT-User too, which prevents ChatGPT from reading your pages in real-time. Whether that's desirable depends on your use case. If you're a news site, you might actually want ChatGPT to be able to cite you.
Anthropic (Claude)
| User-Agent | Purpose | Respects robots.txt | |------------|---------|---------------------| | ClaudeBot | Training data collection | Yes | | Claude-Web | Real-time browsing for Claude | Yes | | Anthropic-ai | General Anthropic crawling | Yes |
Anthropic runs several crawlers. ClaudeBot is the primary one for training. They've been pretty good about respecting robots.txt in my experience.
| User-Agent | Purpose | Respects robots.txt | |------------|---------|---------------------| | Google-Extended | AI training (Gemini, etc.) | Yes | | Googlebot | Regular search indexing | Yes |
This is important: Google-Extended is separate from Googlebot. Blocking Google-Extended does NOT affect your SEO. It only prevents Google from using your content for AI training (like Gemini). This distinction confuses a lot of people.
Googlebot is the regular search crawler. You almost certainly want to allow this unless you have a specific reason not to (like a staging site).
Microsoft/Bing
| User-Agent | Purpose | Respects robots.txt | |------------|---------|---------------------| | Bingbot | Regular search indexing | Yes |
Microsoft has been less transparent about their AI crawling. Copilot appears to use Bing's index rather than its own crawler, so blocking Bingbot would affect both search and AI. Most people don't block Bingbot for this reason.
Meta
| User-Agent | Purpose | Respects robots.txt | |------------|---------|---------------------| | FacebookBot | Link previews + AI training | Partially | | Meta-ExternalAgent | AI training | Yes | | Meta-ExternalFetcher | AI training | Yes |
Meta's crawling situation is messy. FacebookBot handles both link previews (when someone shares your URL on Facebook) and AI training. If you block it entirely, your links won't preview properly on Facebook or Instagram.
Meta-ExternalAgent and Meta-ExternalFetcher are specifically for AI, so those are safer to block without side effects.
The aggressive ones
These crawlers are known for aggressive behavior or ignoring robots.txt. You'll want server-level blocking for these, not just robots.txt.
ByteDance (TikTok)
| User-Agent | Purpose | Respects robots.txt | |------------|---------|---------------------| | Bytespider | Training data collection | Often ignores |
Bytespider is notorious for ignoring robots.txt. I've seen it continue hammering sites well after being blocked in robots.txt. Use .htaccess or server-level rules.
Bytespider crawls for ByteDance's various AI products including TikTok's recommendation system. It's aggressive about crawling frequency and doesn't always honor rate limits.
Common Crawl
| User-Agent | Purpose | Respects robots.txt | |------------|---------|---------------------| | CCBot | Open dataset for AI training | Yes (usually) |
CCBot isn't run by an AI company directly—it's a nonprofit that creates open datasets used by many AI researchers. Your content could end up in multiple AI models through Common Crawl's dataset. They generally respect robots.txt, but they also don't check it as frequently as you might like.
Search and research tools
These are AI-powered search products that crawl to answer user queries.
| User-Agent | Company | Purpose | |------------|---------|---------| | PerplexityBot | Perplexity AI | AI search engine | | Diffbot | Diffbot | Knowledge graph + search | | YouBot | You.com | AI search engine | | Applebot | Apple | Siri + Search (potentially AI) |
PerplexityBot is increasingly common. Perplexity AI is an AI-powered search engine, and their bot crawls pretty actively. Whether you want to block it depends on whether you want to appear in Perplexity search results.
Applebot is a gray area. It's used for Siri and Apple's search features, which are increasingly AI-powered. Blocking it might affect your visibility in Apple's ecosystem.
The scrapers
These are less about legitimate AI services and more about data aggregation and scraping.
| User-Agent | Company | Purpose | |------------|---------|---------| | Omgilibot | Omgili | Content aggregation | | Amazonbot | Amazon | Alexa + product data | | cohere-ai | Cohere | AI training | | img2dataset | Open source | Image scraping for AI |
Amazonbot is another gray area. It handles both Alexa voice assistant data and potentially Amazon's AI training. If you're an e-commerce site, blocking Amazonbot might affect your product visibility.
The complete block list
If you want to block everything AI-related, here's a comprehensive robots.txt:
# OpenAI
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
# Anthropic
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Anthropic-ai
Disallow: /
# Google AI (NOT regular search)
User-agent: Google-Extended
Disallow: /
# Meta AI
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Meta-ExternalFetcher
Disallow: /
# ByteDance
User-agent: Bytespider
Disallow: /
# Common Crawl
User-agent: CCBot
Disallow: /
# Perplexity
User-agent: PerplexityBot
Disallow: /
# Others
User-agent: Omgilibot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: YouBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: img2dataset
Disallow: /
# Allow search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Slurp
Allow: /
User-agent: DuckDuckBot
Allow: /
Use our robots.txt generator to create these rules automatically.
Try robots.txt Generator →Finding bots in your logs
Not sure which bots are hitting your site? Here's how to check.
Apache access logs
grep -i "bot\|crawler\|spider" /var/log/apache2/access.log | awk '{print $12}' | sort | uniq -c | sort -rn | head -20
Nginx access logs
grep -i "bot\|crawler\|spider" /var/log/nginx/access.log | awk -F'"' '{print $6}' | sort | uniq -c | sort -rn | head -20
What to look for
Look for User-Agent strings containing:
- "bot" (GPTBot, ClaudeBot, PerplexityBot)
- "spider" (Bytespider)
- "crawler" (various)
- Company names (Anthropic, OpenAI, Meta)
Some scrapers use deceptive User-Agents that look like regular browsers. These are harder to catch without traffic analysis or a WAF.
Staying updated
This list will change. AI companies launch new crawlers, rename existing ones, or acquire other companies with their own bots. A few ways to stay current:
- Check this page — I update it when new crawlers are identified
- Monitor your logs — New User-Agents appearing is a red flag
- Follow security Twitter/X — The infosec community often catches new bots first
The legal landscape
As of late 2024, there's no settled law about AI training on web content. Various lawsuits are pending. robots.txt isn't legally binding, but it does demonstrate your intent.
Some sites are adding explicit terms to their Terms of Service prohibiting AI training. Whether this is enforceable remains to be seen, but it strengthens your legal position if you ever need to pursue action.
My recommendations
If I had to give general advice:
- Block the training bots — GPTBot, ClaudeBot, CCBot, Bytespider, Google-Extended
- Be careful with link preview bots — FacebookBot, Twitterbot affect social sharing
- Keep regular search crawlers — Googlebot, Bingbot unless you have a reason not to
- Use server-level blocking for Bytespider — robots.txt isn't reliable for them
- Monitor your logs monthly — New bots appear regularly
Generate your blocking rules in seconds with our free tools.
For the complete implementation guide, see How to Block AI Crawlers.