Back to blog
ai-crawlersreferenceuser-agents

Complete List of AI Crawlers (2025)

7 min read
Share:
AI robot representing web crawlers and bots

Keeping track of AI crawlers is like playing whack-a-mole. New ones pop up constantly, existing ones change their User-Agent strings, and some don't even identify themselves properly. This is my attempt at a definitive list, updated regularly as things change.

Last verified: December 2024.

The major players

These are the ones you'll see most often in your logs. They're from established AI companies and generally respect robots.txt (with some notable exceptions).

OpenAI

User-AgentPurposeRespects robots.txt
GPTBotTraining data collectionYes
ChatGPT-UserReal-time browsing for ChatGPT PlusYes
OAI-SearchBotSearchGPT web searchYes

OpenAI has been relatively transparent about their crawlers. GPTBot is the main one you'll want to block if you don't want your content in their training data. ChatGPT-User is different—it's used when someone asks ChatGPT to browse a specific URL, more like a regular user agent.

Some people block ChatGPT-User too, which prevents ChatGPT from reading your pages in real-time. Whether that's desirable depends on your use case. If you're a news site, you might actually want ChatGPT to be able to cite you.

Anthropic (Claude)

User-AgentPurposeRespects robots.txt
ClaudeBotTraining data collectionYes
Claude-WebReal-time browsing for ClaudeYes
Anthropic-aiGeneral Anthropic crawlingYes

Anthropic runs several crawlers. ClaudeBot is the primary one for training. They've been pretty good about respecting robots.txt in my experience.

Google

User-AgentPurposeRespects robots.txt
Google-ExtendedAI training (Gemini, etc.)Yes
GooglebotRegular search indexingYes

This is important: Google-Extended is separate from Googlebot. Blocking Google-Extended does NOT affect your SEO. It only prevents Google from using your content for AI training (like Gemini). This distinction confuses a lot of people.

Googlebot is the regular search crawler. You almost certainly want to allow this unless you have a specific reason not to (like a staging site).

Microsoft/Bing

User-AgentPurposeRespects robots.txt
BingbotRegular search indexingYes

Microsoft has been less transparent about their AI crawling. Copilot appears to use Bing's index rather than its own crawler, so blocking Bingbot would affect both search and AI. Most people don't block Bingbot for this reason.

Meta

User-AgentPurposeRespects robots.txt
FacebookBotLink previews + AI trainingPartially
Meta-ExternalAgentAI trainingYes
Meta-ExternalFetcherAI trainingYes

Meta's crawling situation is messy. FacebookBot handles both link previews (when someone shares your URL on Facebook) and AI training. If you block it entirely, your links won't preview properly on Facebook or Instagram.

Meta-ExternalAgent and Meta-ExternalFetcher are specifically for AI, so those are safer to block without side effects.

The aggressive ones

These crawlers are known for aggressive behavior or ignoring robots.txt. You'll want server-level blocking for these, not just robots.txt.

ByteDance (TikTok)

User-AgentPurposeRespects robots.txt
BytespiderTraining data collectionOften ignores

Bytespider is notorious for ignoring robots.txt. I've seen it continue hammering sites well after being blocked in robots.txt. Use .htaccess or server-level rules.

Bytespider crawls for ByteDance's various AI products including TikTok's recommendation system. It's aggressive about crawling frequency and doesn't always honor rate limits.

Common Crawl

User-AgentPurposeRespects robots.txt
CCBotOpen dataset for AI trainingYes (usually)

CCBot isn't run by an AI company directly—it's a nonprofit that creates open datasets used by many AI researchers. Your content could end up in multiple AI models through Common Crawl's dataset. They generally respect robots.txt, but they also don't check it as frequently as you might like.

Search and research tools

These are AI-powered search products that crawl to answer user queries.

User-AgentCompanyPurpose
PerplexityBotPerplexity AIAI search engine
DiffbotDiffbotKnowledge graph + search
YouBotYou.comAI search engine
ApplebotAppleSiri + Search (potentially AI)

PerplexityBot is increasingly common. Perplexity AI is an AI-powered search engine, and their bot crawls pretty actively. Whether you want to block it depends on whether you want to appear in Perplexity search results.

Applebot is a gray area. It's used for Siri and Apple's search features, which are increasingly AI-powered. Blocking it might affect your visibility in Apple's ecosystem.

The scrapers

These are less about legitimate AI services and more about data aggregation and scraping.

User-AgentCompanyPurpose
OmgilibotOmgiliContent aggregation
AmazonbotAmazonAlexa + product data
cohere-aiCohereAI training
img2datasetOpen sourceImage scraping for AI

Amazonbot is another gray area. It handles both Alexa voice assistant data and potentially Amazon's AI training. If you're an e-commerce site, blocking Amazonbot might affect your product visibility.

The complete block list

If you want to block everything AI-related, here's a comprehensive robots.txt:

# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

# Anthropic
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Anthropic-ai
Disallow: /

# Google AI (NOT regular search)
User-agent: Google-Extended
Disallow: /

# Meta AI
User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

# ByteDance
User-agent: Bytespider
Disallow: /

# Common Crawl
User-agent: CCBot
Disallow: /

# Perplexity
User-agent: PerplexityBot
Disallow: /

# Others
User-agent: Omgilibot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: img2dataset
Disallow: /

# Allow search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Slurp
Allow: /

User-agent: DuckDuckBot
Allow: /
Want to skip the copy-paste?

Use our robots.txt generator to create these rules automatically.

Try robots.txt Generator

Finding bots in your logs

Not sure which bots are hitting your site? Here's how to check.

Apache access logs

grep -i "bot\|crawler\|spider" /var/log/apache2/access.log | awk '{print $12}' | sort | uniq -c | sort -rn | head -20

Nginx access logs

grep -i "bot\|crawler\|spider" /var/log/nginx/access.log | awk -F'"' '{print $6}' | sort | uniq -c | sort -rn | head -20

What to look for

Look for User-Agent strings containing:

  • "bot" (GPTBot, ClaudeBot, PerplexityBot)
  • "spider" (Bytespider)
  • "crawler" (various)
  • Company names (Anthropic, OpenAI, Meta)

Some scrapers use deceptive User-Agents that look like regular browsers. These are harder to catch without traffic analysis or a WAF.

Staying updated

This list will change. AI companies launch new crawlers, rename existing ones, or acquire other companies with their own bots. A few ways to stay current:

  1. Check this page — I update it when new crawlers are identified
  2. Monitor your logs — New User-Agents appearing is a red flag
  3. Follow security Twitter/X — The infosec community often catches new bots first

The legal landscape

As of late 2024, there's no settled law about AI training on web content. Various lawsuits are pending. robots.txt isn't legally binding, but it does demonstrate your intent.

Some sites are adding explicit terms to their Terms of Service prohibiting AI training. Whether this is enforceable remains to be seen, but it strengthens your legal position if you ever need to pursue action.

My recommendations

If I had to give general advice:

  1. Block the training bots — GPTBot, ClaudeBot, CCBot, Bytespider, Google-Extended
  2. Be careful with link preview bots — FacebookBot, Twitterbot affect social sharing
  3. Keep regular search crawlers — Googlebot, Bingbot unless you have a reason not to
  4. Use server-level blocking for Bytespider — robots.txt isn't reliable for them
  5. Monitor your logs monthly — New bots appear regularly
Skip the manual work

Generate your blocking rules in seconds with our free tools.

For the complete implementation guide, see How to Block AI Crawlers.

Found this helpful? Share it with others.

Share:

Ready to block AI crawlers?

Use our free generators to create your blocking rules in seconds.