Keeping track of AI crawlers is like playing whack-a-mole. New ones pop up constantly, existing ones change their User-Agent strings, and some don't even identify themselves properly. This is my attempt at a definitive list, updated regularly as things change.

Last verified: December 2024.

The major players

These are the ones you'll see most often in your logs. They're from established AI companies and generally respect robots.txt (with some notable exceptions).

OpenAI

User-Agent	Purpose	Respects robots.txt
GPTBot	Training data collection	Yes
ChatGPT-User	Real-time browsing for ChatGPT Plus	Yes
OAI-SearchBot	SearchGPT web search	Yes

OpenAI has been relatively transparent about their crawlers. GPTBot is the main one you'll want to block if you don't want your content in their training data. ChatGPT-User is different—it's used when someone asks ChatGPT to browse a specific URL, more like a regular user agent.

Some people block ChatGPT-User too, which prevents ChatGPT from reading your pages in real-time. Whether that's desirable depends on your use case. If you're a news site, you might actually want ChatGPT to be able to cite you.

Anthropic (Claude)

User-Agent	Purpose	Respects robots.txt
ClaudeBot	Training data collection	Yes
Claude-Web	Real-time browsing for Claude	Yes
Anthropic-ai	General Anthropic crawling	Yes

Anthropic runs several crawlers. ClaudeBot is the primary one for training. They've been pretty good about respecting robots.txt in my experience.

Google

User-Agent	Purpose	Respects robots.txt
Google-Extended	AI training (Gemini, etc.)	Yes
Googlebot	Regular search indexing	Yes

This is important: Google-Extended is separate from Googlebot. Blocking Google-Extended does NOT affect your SEO. It only prevents Google from using your content for AI training (like Gemini). This distinction confuses a lot of people.

Googlebot is the regular search crawler. You almost certainly want to allow this unless you have a specific reason not to (like a staging site).

Microsoft/Bing

User-Agent	Purpose	Respects robots.txt
Bingbot	Regular search indexing	Yes

Microsoft has been less transparent about their AI crawling. Copilot appears to use Bing's index rather than its own crawler, so blocking Bingbot would affect both search and AI. Most people don't block Bingbot for this reason.

The aggressive ones

These crawlers are known for aggressive behavior or ignoring robots.txt. You'll want server-level blocking for these, not just robots.txt.

ByteDance (TikTok)

User-Agent	Purpose	Respects robots.txt
Bytespider	Training data collection	Often ignores

Bytespider is notorious for ignoring robots.txt. I've seen it continue hammering sites well after being blocked in robots.txt. Use .htaccess or server-level rules.

Bytespider crawls for ByteDance's various AI products including TikTok's recommendation system. It's aggressive about crawling frequency and doesn't always honor rate limits.

Common Crawl

User-Agent	Purpose	Respects robots.txt
CCBot	Open dataset for AI training	Yes (usually)

CCBot isn't run by an AI company directly—it's a nonprofit that creates open datasets used by many AI researchers. Your content could end up in multiple AI models through Common Crawl's dataset. They generally respect robots.txt, but they also don't check it as frequently as you might like.

Search and research tools

These are AI-powered search products that crawl to answer user queries.

User-Agent	Company	Purpose
PerplexityBot	Perplexity AI	AI search engine
Diffbot	Diffbot	Knowledge graph + search
YouBot	You.com	AI search engine
Applebot	Apple	Siri + Search (potentially AI)

PerplexityBot is increasingly common. Perplexity AI is an AI-powered search engine, and their bot crawls pretty actively. Whether you want to block it depends on whether you want to appear in Perplexity search results.

Applebot is a gray area. It's used for Siri and Apple's search features, which are increasingly AI-powered. Blocking it might affect your visibility in Apple's ecosystem.

The scrapers

These are less about legitimate AI services and more about data aggregation and scraping.

User-Agent	Company	Purpose
Omgilibot	Omgili	Content aggregation
Amazonbot	Amazon	Alexa + product data
cohere-ai	Cohere	AI training
img2dataset	Open source	Image scraping for AI

Amazonbot is another gray area. It handles both Alexa voice assistant data and potentially Amazon's AI training. If you're an e-commerce site, blocking Amazonbot might affect your product visibility.

The complete block list

If you want to block everything AI-related, here's a comprehensive robots.txt:

# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

# Anthropic
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Anthropic-ai
Disallow: /

# Google AI (NOT regular search)
User-agent: Google-Extended
Disallow: /

# Meta AI
User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

# ByteDance
User-agent: Bytespider
Disallow: /

# Common Crawl
User-agent: CCBot
Disallow: /

# Perplexity
User-agent: PerplexityBot
Disallow: /

# Others
User-agent: Omgilibot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: img2dataset
Disallow: /

# Allow search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: Slurp
Allow: /

User-agent: DuckDuckBot
Allow: /

Want to skip the copy-paste?

Use our robots.txt generator to create these rules automatically.

Try robots.txt Generator →

Finding bots in your logs

Not sure which bots are hitting your site? Here's how to check.

Apache access logs

grep -i "bot\|crawler\|spider" /var/log/apache2/access.log | awk '{print $12}' | sort | uniq -c | sort -rn | head -20

Nginx access logs

grep -i "bot\|crawler\|spider" /var/log/nginx/access.log | awk -F'"' '{print $6}' | sort | uniq -c | sort -rn | head -20

What to look for

Look for User-Agent strings containing:

"bot" (GPTBot, ClaudeBot, PerplexityBot)
"spider" (Bytespider)
"crawler" (various)
Company names (Anthropic, OpenAI, Meta)

Some scrapers use deceptive User-Agents that look like regular browsers. These are harder to catch without traffic analysis or a WAF.

Staying updated

This list will change. AI companies launch new crawlers, rename existing ones, or acquire other companies with their own bots. A few ways to stay current:

Check this page — I update it when new crawlers are identified
Monitor your logs — New User-Agents appearing is a red flag
Follow security Twitter/X — The infosec community often catches new bots first

The legal landscape

As of late 2024, there's no settled law about AI training on web content. Various lawsuits are pending. robots.txt isn't legally binding, but it does demonstrate your intent.

Some sites are adding explicit terms to their Terms of Service prohibiting AI training. Whether this is enforceable remains to be seen, but it strengthens your legal position if you ever need to pursue action.

My recommendations

If I had to give general advice:

Block the training bots — GPTBot, ClaudeBot, CCBot, Bytespider, Google-Extended
Be careful with link preview bots — FacebookBot, Twitterbot affect social sharing
Keep regular search crawlers — Googlebot, Bingbot unless you have a reason not to
Use server-level blocking for Bytespider — robots.txt isn't reliable for them
Monitor your logs monthly — New bots appear regularly

Skip the manual work

Generate your blocking rules in seconds with our free tools.

robots.txt Generator .htaccess Generator

For the complete implementation guide, see How to Block AI Crawlers.

Complete List of AI Crawlers (2025)