How to Block AI Crawlers: The Complete Guide (2025)
So you've noticed AI bots hammering your server logs and you want them gone. Fair enough—I spent a weekend dealing with this on my own sites after seeing Bytespider hit my pages 10,000 times in a single day. Here's what actually works.
Why websites are blocking AI crawlers
Before we get into the how, let's talk about the why. There are basically three camps:
-
Bandwidth concerns — AI crawlers are aggressive. They don't always respect crawl-delay directives, and some will hammer your server relentlessly. If you're on shared hosting or paying for bandwidth, this adds up fast.
-
Content protection — Your content gets slurped up into training datasets without compensation or attribution. Whether you're a publisher, blogger, or running a documentation site, this rankles.
-
Legal/compliance — Some organizations have policies about AI training data. Others are in regulated industries where any data leakage is a problem.
There's no wrong reason here. It's your server, your content, your rules.
The two main methods
You've got two primary tools for blocking AI crawlers:
robots.txt (polite request)
The robots.txt file is the traditional way to tell crawlers what they can and can't access. You put it at the root of your site (e.g., https://yoursite.com/robots.txt), and well-behaved bots check it before crawling.
Here's the thing though: robots.txt is a suggestion, not a command. Bots can ignore it entirely. Most legitimate AI crawlers—GPTBot, ClaudeBot, Google-Extended—do respect it. But some don't. We'll get to those.
.htaccess / server rules (hard block)
For Apache servers, .htaccess rules can block bots at the server level. They never even get your content—the server returns a 403 Forbidden. Nginx, Cloudflare, and other platforms have their own equivalents.
This is the "no really, go away" option.
Blocking with robots.txt
Let's start with the basics. Here's a robots.txt that blocks the major AI crawlers:
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Omgilibot
Disallow: /
# Allow regular search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: *
Allow: /
Notice I'm explicitly allowing Googlebot and Bingbot. This is important—you don't want to accidentally tank your SEO while blocking AI crawlers. Google-Extended is separate from Googlebot; blocking one doesn't affect the other.
Use our robots.txt generator to create these rules automatically.
Try robots.txt Generator →The crawl-delay gotcha
Some people try adding Crawl-delay: 10 to slow down bots instead of blocking them entirely. Here's the problem: most AI crawlers don't respect it. The Crawl-delay directive isn't part of the official robots.txt spec, and support is spotty at best.
If you want to rate-limit bots rather than block them, you'll need server-level controls.
Blocking with .htaccess
For Apache servers, .htaccess gives you actual enforcement power. Here's a solid blocking ruleset:
# Block AI crawlers by User-Agent
<IfModule mod_rewrite.c>
RewriteEngine On
# OpenAI
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ChatGPT [NC,OR]
# Anthropic
RewriteCond %{HTTP_USER_AGENT} Claude [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Anthropic [NC,OR]
# Google AI (not search!)
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC,OR]
# Common Crawl
RewriteCond %{HTTP_USER_AGENT} CCBot [NC,OR]
# ByteDance
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC,OR]
# Perplexity
RewriteCond %{HTTP_USER_AGENT} PerplexityBot [NC]
RewriteRule .* - [F,L]
</IfModule>
This returns a 403 Forbidden to any request with a matching User-Agent. The [NC] flag makes it case-insensitive, and [OR] chains the conditions together.
Test your .htaccess changes on a staging site first. A syntax error can take down your entire site.
Use our .htaccess generator to create these rules automatically.
Try .htaccess Generator →The problem bots
Now here's where it gets frustrating. Some bots just don't play nice.
Bytespider (ByteDance/TikTok)
Bytespider is notorious for ignoring robots.txt. ByteDance claims they respect it, but server logs tell a different story. You pretty much need .htaccess or server-level blocking for this one.
Bytespider has been observed ignoring robots.txt directives. Use server-level blocking (.htaccess, nginx rules, or WAF) for reliable enforcement.
Random scrapers
Beyond the big names, there are countless scrapers that don't identify themselves properly or use generic User-Agents. Blocking these requires more aggressive measures:
- IP-based blocking (but they use rotating proxies)
- Rate limiting
- CAPTCHAs for suspicious traffic
- Cloudflare Bot Fight Mode
Honestly, if you're dealing with serious scraping, a WAF (Web Application Firewall) like Cloudflare's is worth considering. The free tier blocks a lot of garbage.
Platform-specific guides
The exact implementation varies depending on your setup:
-
WordPress — Use a plugin like Blackhole for Bad Bots, or add rules to your .htaccess. See our WordPress guide.
-
Nginx — Use
ifstatements in your server config withreturn 403. See our Nginx guide. -
Apache — The .htaccess method above works. See our Apache guide.
-
Cloudflare — Create custom firewall rules based on User-Agent. See our Cloudflare guide.
-
Vercel/Netlify — Use middleware or edge functions. See our Vercel guide or Netlify guide.
Should you use both?
Yes. robots.txt handles the polite bots, .htaccess (or equivalent) handles the rest. Belt and suspenders.
The robots.txt also serves as documentation of your intent. If you ever need to pursue legal action against scrapers, having a clear robots.txt that prohibits AI crawling establishes that you didn't consent.
What about partial blocking?
Maybe you want AI to crawl some pages but not others. That's valid. You can use path-based rules:
User-agent: GPTBot
Disallow: /premium/
Disallow: /members/
Allow: /blog/
Allow: /docs/
This blocks GPTBot from /premium/ and /members/ but allows it to crawl /blog/ and /docs/. Useful if you want some content indexed by AI while protecting paid content.
Verifying it works
After setting up your blocks, you'll want to verify they're working. Unfortunately, you can't just check by visiting your site—your browser isn't pretending to be GPTBot.
A few options:
- curl with custom User-Agent:
curl -A "GPTBot/1.0" https://yoursite.com/
Should return 403 Forbidden if your .htaccess is working.
-
Check your server logs — Look for requests from blocked User-Agents. If they're still getting 200 OK responses, something's wrong.
-
Wait and watch — AI crawlers typically return every few days. If your logs show 403s for them, you're good.
The "nuclear option"
If you really want to be thorough, you can combine multiple approaches:
- robots.txt blocking (catches well-behaved bots)
- .htaccess User-Agent blocking (catches most others)
- Cloudflare Bot Fight Mode (catches the rest)
- IP blocking for known offender ranges (if you're really determined)
Is this overkill for most sites? Probably. But if you're running a content-heavy site and you're serious about protecting your work, it's an option.
What blocking won't do
A few things to keep in mind:
-
Won't remove existing training data — If your content was already scraped before you blocked the bots, it's already in their datasets. Blocking only prevents future crawling.
-
Won't stop all AI use of your content — Users can still copy-paste your content into ChatGPT. You're blocking automated crawling, not all AI access.
-
Won't guarantee protection — Determined scrapers can bypass User-Agent checks. This isn't a security measure, it's a speed bump.
Quick summary
- Add AI crawlers to your robots.txt (catches most legitimate bots)
- Add .htaccess rules or server-level blocks (catches the rest)
- Consider Cloudflare or similar WAF for additional protection
- Monitor your logs to verify it's working
Generate your blocking rules in seconds with our free tools.
If you want to skip the manual work, our robots.txt generator and .htaccess generator can create these rules for you in about 30 seconds. Just check the boxes for the bots you want to block.
The AI crawler landscape changes frequently. New bots appear, existing ones change their User-Agent strings. Check back on our complete crawler list for updates.