Back to blog
ai-crawlersopinionpublishers

Why Websites Are Blocking AI Crawlers

December 6, 2024(Updated: Dec 6, 2024)6 min read
Share:
Website protection and security concept

When I tell people I help websites block AI crawlers, the first question is usually "why would anyone do that?" Fair question. Here's what I've learned from talking to site owners who've made that choice.

The numbers tell a story

According to various studies in late 2024:

  • About 35% of top websites now block GPTBot
  • News sites block AI crawlers at even higher rates (some estimates say 50%+)
  • The trend is accelerating, not slowing down

These aren't paranoid webmasters. They're making a calculated decision. Let me break down the reasoning.

Reason 1: The training data problem

This is the big one. When GPTBot or ClaudeBot crawls your site, that content can end up in the next version of their AI models. Your articles, your documentation, your blog posts—now training data.

The issue isn't that AI reads your content. Humans read it too; that's the point. The issue is:

  1. No compensation — AI companies charge billions for model access. Your content helped train those models. You see none of that revenue.

  2. No attribution — When ChatGPT answers a question using knowledge from your article, there's no citation back to you. No traffic, no credit.

  3. No control — Once your content is in a training dataset, you can't remove it. There's no "right to be forgotten" for AI training data.

For content creators who invest in original work, this feels like value extraction without consent.

Reason 2: Server costs

AI crawlers are hungry. They crawl aggressively and frequently. I've seen logs where Bytespider hit a modest site 10,000+ times in a single day.

If you're on shared hosting or paying for bandwidth, this adds up. Even if you're on a flat-rate plan, the server load affects performance for real visitors.

Bytespider is particularly notorious for this. It's been widely reported to ignore robots.txt and Crawl-delay directives. Your polite request to slow down goes unheard.

Reason 3: Competitive concerns

Imagine you're a company that's invested heavily in creating expert content—technical documentation, tutorials, industry analysis. Now imagine a competitor uses ChatGPT to generate similar content, trained partly on yours.

The playing field doesn't feel level. Your R&D invested in creating knowledge; theirs just referenced it through an AI intermediary.

This concern is especially acute for:

  • Technical documentation sites
  • Educational content creators
  • Niche expertise publishers
  • API documentation (ironic given how many AI companies have APIs)

Reason 4: Legal uncertainty

The legal landscape for AI training is genuinely unsettled. Multiple lawsuits are pending. The copyright implications of training on web content haven't been definitively resolved.

Some organizations block AI crawlers as a risk management measure. If your legal team is nervous about how your content might be used, blocking crawlers is the simplest mitigation.

Even if you're not worried about liability, having a clear robots.txt that prohibits AI training establishes your intent. That documentation matters if legal precedents shift.

Reason 5: Quality control concerns

Some publishers worry about AI hallucinations citing their content incorrectly. Your well-researched article gets summarized by an AI, but the summary introduces errors. Now there's wrong information floating around with your content as the implicit source.

This is especially concerning for:

  • Medical and health information
  • Financial and legal advice
  • Scientific and technical content
  • News reporting

The other side

It's worth acknowledging the counterarguments. Some people think blocking AI crawlers is:

  • Futile — The AI is already trained. Blocking now doesn't undo past crawling.
  • Counterproductive — Maybe you want AI to be able to answer questions about your product/content.
  • Against the spirit of the web — The internet was built on openness and sharing.

These are legitimate points. Blocking AI crawlers is a choice, not a mandate. For some sites, the trade-offs favor blocking. For others, they don't.

Who's blocking?

Based on public reporting and my own observations:

Blocking heavily:

  • Major news publishers (New York Times, Washington Post, many others)
  • Stock photo and image sites
  • Publishers with paywalled content
  • Some academic publishers
  • Various tech companies (including, ironically, some AI companies)

Not blocking (generally):

  • Open-source projects
  • Public documentation
  • Sites that want AI referral traffic
  • Content marketers (some see AI citations as exposure)

The practical reality

Here's what I tell people: blocking AI crawlers is neither paranoid nor pointless. It's a reasonable choice for sites that:

  1. Create significant original content
  2. Rely on that content for revenue or competitive advantage
  3. Care about control over how their content is used

For a personal blog? Maybe doesn't matter much. For a business built on content? Worth considering seriously.

What blocking won't do

Let's be realistic about limitations:

  • Won't undo past training — If your content was crawled before you blocked, it's potentially in training datasets already.
  • Won't stop all AI use — Users can copy-paste your content into ChatGPT. Blocking crawlers prevents automated training data collection, not all AI interaction.
  • Won't guarantee protection — Determined scrapers can use deceptive User-Agents or proxies. Blocking is a speed bump, not a wall.

My take

I think the current situation is genuinely unfair to content creators. AI companies have extracted enormous value from web content without consent or compensation. The fact that it's technically legal (at least for now) doesn't make it ethically clean.

That said, I also think the genie is out of the bottle. AI isn't going away, and the training has already happened. Blocking now is about:

  1. Taking a stance (your content, your rules)
  2. Preventing future training on new content
  3. Documenting your intent for potential legal protection

Is it a perfect solution? No. But it's something you can control.

How to block

If you've decided to block AI crawlers, here are the technical guides:

Skip the manual work

Generate your blocking rules in seconds with our free tools.

The choice is yours. Now you have the information to make it.

Found this helpful? Share it with others.

Share:

Ready to block AI crawlers?

Use our free generators to create your blocking rules in seconds.