Back to blog
ccbotcommon-crawlrobots-txthtaccess

How to Block CCBot (Common Crawl)

November 25, 2024(Updated: Dec 5, 2024)4 min read
Share:
Cyber security and network protection

CCBot is different from the other crawlers on this list. It's not run by an AI company—it's run by Common Crawl, a nonprofit that creates open datasets used by researchers and AI companies alike.

The twist? Your content could end up in multiple AI systems through a single crawler.

What is Common Crawl?

Common Crawl is a nonprofit that maintains a massive open archive of web content. They've been doing this since 2008, way before the current AI boom. Their dataset includes:

  • Billions of web pages
  • Regular snapshots of the web
  • Text, metadata, and structure

AI researchers love it because it's free, comprehensive, and available for training. GPT-3 used Common Crawl data. So did many other models.

The problem: blocking GPTBot doesn't stop OpenAI from using Common Crawl's archived copy of your content. You need to block CCBot too.

The quick block

robots.txt:

User-agent: CCBot
Disallow: /
Want to skip the copy-paste?

Use our robots.txt generator to create these rules automatically.

Try robots.txt Generator

Does CCBot respect robots.txt?

Generally, yes. Common Crawl says they honor robots.txt directives. In my experience, they do. But there's a catch.

CCBot checks robots.txt periodically, not continuously. If your content was crawled before you added the block, it's in their archive. And they release new crawl dumps regularly, so there can be a lag between adding a block and it taking effect.

Full robots.txt example

# Block AI training crawlers
User-agent: CCBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Allow search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Server-level blocking

For faster enforcement, add server-level rules.

Apache (.htaccess)

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} CCBot [NC]
RewriteRule .* - [F,L]
</IfModule>
Want to skip the copy-paste?

Use our .htaccess generator to create these rules automatically.

Try .htaccess Generator

Nginx

if ($http_user_agent ~* "CCBot") {
    return 403;
}

The archive problem

Here's the uncomfortable truth: Common Crawl has been archiving the web for over 15 years. If your site has been online for any length of time, there's a good chance it's already in their archives.

Blocking CCBot now prevents future crawls. It doesn't remove your existing content from their datasets. Those archives are out there, being used for training.

Common Crawl has no process for requesting removal of specific content from past archives. The data is published; it's done.

Who uses Common Crawl data?

A partial list:

  • OpenAI (GPT models)
  • Google (research, possibly Gemini)
  • Meta (LLaMA models)
  • Stability AI (Stable Diffusion text encoders)
  • Various academic researchers

When you block CCBot, you're potentially affecting data for all of these. It's a pretty high-leverage block.

Should you block CCBot?

Arguments for blocking:

  • One crawler, many downstream uses
  • Prevents future training data collection
  • Common Crawl data is freely available, so anyone can use it

Arguments against:

  • Your content might already be archived
  • Common Crawl supports legitimate research, not just commercial AI
  • The "open web" argument (debatable)

I lean toward blocking if you're serious about protecting content. The horse may have left the barn, but you can close the door on future horses.

Checking for CCBot in your logs

grep -i "CCBot" /var/log/apache2/access.log | wc -l

CCBot crawls less frequently than GPTBot or ClaudeBot. You might see it weekly or monthly rather than daily. But when it hits, it crawls thoroughly.

Verification

curl -A "CCBot/2.0" -I https://yoursite.com/

Should return 403 Forbidden if your server block is active.

For robots.txt verification, check that the rule is properly formatted and accessible at your domain root.

A note on ai.txt

Some sites are experimenting with an ai.txt file as a way to communicate AI training preferences. This is not a standard and not widely supported. Stick with robots.txt and server-level blocking for actual enforcement. ai.txt is more of a policy statement than a technical control.

My take

CCBot is worth blocking if you're blocking AI crawlers generally. Because Common Crawl data is used by so many downstream consumers, it's actually one of the more impactful crawlers to block.

That said, be realistic about the limitations. Past crawls are past crawls. You're protecting future content, not retroactively removing existing data.

Skip the manual work

Generate your blocking rules in seconds with our free tools.

See also:

Found this helpful? Share it with others.

Share:

Ready to block AI crawlers?

Use our free generators to create your blocking rules in seconds.