Back to blog
ai-crawlersrobots-txthtaccesssecurity

How to Block AI Crawlers: The Complete Guide (2025)

November 18, 2024(Updated: Dec 5, 2024)7 min read
Share:
Digital security shield protecting against web crawlers

So you've noticed AI bots hammering your server logs and you want them gone. Fair enough—I spent a weekend dealing with this on my own sites after seeing Bytespider hit my pages 10,000 times in a single day. Here's what actually works.

Why websites are blocking AI crawlers

Before we get into the how, let's talk about the why. There are basically three camps:

  1. Bandwidth concerns — AI crawlers are aggressive. They don't always respect crawl-delay directives, and some will hammer your server relentlessly. If you're on shared hosting or paying for bandwidth, this adds up fast.

  2. Content protection — Your content gets slurped up into training datasets without compensation or attribution. Whether you're a publisher, blogger, or running a documentation site, this rankles.

  3. Legal/compliance — Some organizations have policies about AI training data. Others are in regulated industries where any data leakage is a problem.

There's no wrong reason here. It's your server, your content, your rules.

The two main methods

You've got two primary tools for blocking AI crawlers:

robots.txt (polite request)

The robots.txt file is the traditional way to tell crawlers what they can and can't access. You put it at the root of your site (e.g., https://yoursite.com/robots.txt), and well-behaved bots check it before crawling.

Here's the thing though: robots.txt is a suggestion, not a command. Bots can ignore it entirely. Most legitimate AI crawlers—GPTBot, ClaudeBot, Google-Extended—do respect it. But some don't. We'll get to those.

.htaccess / server rules (hard block)

For Apache servers, .htaccess rules can block bots at the server level. They never even get your content—the server returns a 403 Forbidden. Nginx, Cloudflare, and other platforms have their own equivalents.

This is the "no really, go away" option.

Blocking with robots.txt

Let's start with the basics. Here's a robots.txt that blocks the major AI crawlers:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Omgilibot
Disallow: /

# Allow regular search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Allow: /

Notice I'm explicitly allowing Googlebot and Bingbot. This is important—you don't want to accidentally tank your SEO while blocking AI crawlers. Google-Extended is separate from Googlebot; blocking one doesn't affect the other.

Want to skip the copy-paste?

Use our robots.txt generator to create these rules automatically.

Try robots.txt Generator

The crawl-delay gotcha

Some people try adding Crawl-delay: 10 to slow down bots instead of blocking them entirely. Here's the problem: most AI crawlers don't respect it. The Crawl-delay directive isn't part of the official robots.txt spec, and support is spotty at best.

If you want to rate-limit bots rather than block them, you'll need server-level controls.

Blocking with .htaccess

For Apache servers, .htaccess gives you actual enforcement power. Here's a solid blocking ruleset:

# Block AI crawlers by User-Agent
<IfModule mod_rewrite.c>
RewriteEngine On

# OpenAI
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ChatGPT [NC,OR]

# Anthropic
RewriteCond %{HTTP_USER_AGENT} Claude [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Anthropic [NC,OR]

# Google AI (not search!)
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC,OR]

# Common Crawl
RewriteCond %{HTTP_USER_AGENT} CCBot [NC,OR]

# ByteDance
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC,OR]

# Perplexity
RewriteCond %{HTTP_USER_AGENT} PerplexityBot [NC]

RewriteRule .* - [F,L]
</IfModule>

This returns a 403 Forbidden to any request with a matching User-Agent. The [NC] flag makes it case-insensitive, and [OR] chains the conditions together.

Test your .htaccess changes on a staging site first. A syntax error can take down your entire site.

Want to skip the copy-paste?

Use our .htaccess generator to create these rules automatically.

Try .htaccess Generator

The problem bots

Now here's where it gets frustrating. Some bots just don't play nice.

Bytespider (ByteDance/TikTok)

Bytespider is notorious for ignoring robots.txt. ByteDance claims they respect it, but server logs tell a different story. You pretty much need .htaccess or server-level blocking for this one.

Bytespider has been observed ignoring robots.txt directives. Use server-level blocking (.htaccess, nginx rules, or WAF) for reliable enforcement.

Random scrapers

Beyond the big names, there are countless scrapers that don't identify themselves properly or use generic User-Agents. Blocking these requires more aggressive measures:

  • IP-based blocking (but they use rotating proxies)
  • Rate limiting
  • CAPTCHAs for suspicious traffic
  • Cloudflare Bot Fight Mode

Honestly, if you're dealing with serious scraping, a WAF (Web Application Firewall) like Cloudflare's is worth considering. The free tier blocks a lot of garbage.

Platform-specific guides

The exact implementation varies depending on your setup:

Should you use both?

Yes. robots.txt handles the polite bots, .htaccess (or equivalent) handles the rest. Belt and suspenders.

The robots.txt also serves as documentation of your intent. If you ever need to pursue legal action against scrapers, having a clear robots.txt that prohibits AI crawling establishes that you didn't consent.

What about partial blocking?

Maybe you want AI to crawl some pages but not others. That's valid. You can use path-based rules:

User-agent: GPTBot
Disallow: /premium/
Disallow: /members/
Allow: /blog/
Allow: /docs/

This blocks GPTBot from /premium/ and /members/ but allows it to crawl /blog/ and /docs/. Useful if you want some content indexed by AI while protecting paid content.

Verifying it works

After setting up your blocks, you'll want to verify they're working. Unfortunately, you can't just check by visiting your site—your browser isn't pretending to be GPTBot.

A few options:

  1. curl with custom User-Agent:
curl -A "GPTBot/1.0" https://yoursite.com/

Should return 403 Forbidden if your .htaccess is working.

  1. Check your server logs — Look for requests from blocked User-Agents. If they're still getting 200 OK responses, something's wrong.

  2. Wait and watch — AI crawlers typically return every few days. If your logs show 403s for them, you're good.

The "nuclear option"

If you really want to be thorough, you can combine multiple approaches:

  1. robots.txt blocking (catches well-behaved bots)
  2. .htaccess User-Agent blocking (catches most others)
  3. Cloudflare Bot Fight Mode (catches the rest)
  4. IP blocking for known offender ranges (if you're really determined)

Is this overkill for most sites? Probably. But if you're running a content-heavy site and you're serious about protecting your work, it's an option.

What blocking won't do

A few things to keep in mind:

  • Won't remove existing training data — If your content was already scraped before you blocked the bots, it's already in their datasets. Blocking only prevents future crawling.

  • Won't stop all AI use of your content — Users can still copy-paste your content into ChatGPT. You're blocking automated crawling, not all AI access.

  • Won't guarantee protection — Determined scrapers can bypass User-Agent checks. This isn't a security measure, it's a speed bump.

Quick summary

  1. Add AI crawlers to your robots.txt (catches most legitimate bots)
  2. Add .htaccess rules or server-level blocks (catches the rest)
  3. Consider Cloudflare or similar WAF for additional protection
  4. Monitor your logs to verify it's working
Skip the manual work

Generate your blocking rules in seconds with our free tools.

If you want to skip the manual work, our robots.txt generator and .htaccess generator can create these rules for you in about 30 seconds. Just check the boxes for the bots you want to block.

The AI crawler landscape changes frequently. New bots appear, existing ones change their User-Agent strings. Check back on our complete crawler list for updates.

Found this helpful? Share it with others.

Share:

Ready to block AI crawlers?

Use our free generators to create your blocking rules in seconds.