How to Block CCBot (Common Crawl)
CCBot is different from the other crawlers on this list. It's not run by an AI company—it's run by Common Crawl, a nonprofit that creates open datasets used by researchers and AI companies alike.
The twist? Your content could end up in multiple AI systems through a single crawler.
What is Common Crawl?
Common Crawl is a nonprofit that maintains a massive open archive of web content. They've been doing this since 2008, way before the current AI boom. Their dataset includes:
- Billions of web pages
- Regular snapshots of the web
- Text, metadata, and structure
AI researchers love it because it's free, comprehensive, and available for training. GPT-3 used Common Crawl data. So did many other models.
The problem: blocking GPTBot doesn't stop OpenAI from using Common Crawl's archived copy of your content. You need to block CCBot too.
The quick block
robots.txt:
User-agent: CCBot
Disallow: /
Use our robots.txt generator to create these rules automatically.
Try robots.txt Generator →Does CCBot respect robots.txt?
Generally, yes. Common Crawl says they honor robots.txt directives. In my experience, they do. But there's a catch.
CCBot checks robots.txt periodically, not continuously. If your content was crawled before you added the block, it's in their archive. And they release new crawl dumps regularly, so there can be a lag between adding a block and it taking effect.
Full robots.txt example
# Block AI training crawlers
User-agent: CCBot
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Allow search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
Server-level blocking
For faster enforcement, add server-level rules.
Apache (.htaccess)
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} CCBot [NC]
RewriteRule .* - [F,L]
</IfModule>
Use our .htaccess generator to create these rules automatically.
Try .htaccess Generator →Nginx
if ($http_user_agent ~* "CCBot") {
return 403;
}
The archive problem
Here's the uncomfortable truth: Common Crawl has been archiving the web for over 15 years. If your site has been online for any length of time, there's a good chance it's already in their archives.
Blocking CCBot now prevents future crawls. It doesn't remove your existing content from their datasets. Those archives are out there, being used for training.
Common Crawl has no process for requesting removal of specific content from past archives. The data is published; it's done.
Who uses Common Crawl data?
A partial list:
- OpenAI (GPT models)
- Google (research, possibly Gemini)
- Meta (LLaMA models)
- Stability AI (Stable Diffusion text encoders)
- Various academic researchers
When you block CCBot, you're potentially affecting data for all of these. It's a pretty high-leverage block.
Should you block CCBot?
Arguments for blocking:
- One crawler, many downstream uses
- Prevents future training data collection
- Common Crawl data is freely available, so anyone can use it
Arguments against:
- Your content might already be archived
- Common Crawl supports legitimate research, not just commercial AI
- The "open web" argument (debatable)
I lean toward blocking if you're serious about protecting content. The horse may have left the barn, but you can close the door on future horses.
Checking for CCBot in your logs
grep -i "CCBot" /var/log/apache2/access.log | wc -l
CCBot crawls less frequently than GPTBot or ClaudeBot. You might see it weekly or monthly rather than daily. But when it hits, it crawls thoroughly.
Verification
curl -A "CCBot/2.0" -I https://yoursite.com/
Should return 403 Forbidden if your server block is active.
For robots.txt verification, check that the rule is properly formatted and accessible at your domain root.
A note on ai.txt
Some sites are experimenting with an ai.txt file as a way to communicate AI training preferences. This is not a standard and not widely supported. Stick with robots.txt and server-level blocking for actual enforcement. ai.txt is more of a policy statement than a technical control.
My take
CCBot is worth blocking if you're blocking AI crawlers generally. Because Common Crawl data is used by so many downstream consumers, it's actually one of the more impactful crawlers to block.
That said, be realistic about the limitations. Past crawls are past crawls. You're protecting future content, not retroactively removing existing data.
Generate your blocking rules in seconds with our free tools.
See also: