robots.txt vs .htaccess: Which Should You Use to Block AI Crawlers?
Two tools, two different jobs. Understanding when to use robots.txt vs .htaccess saves you headaches and gives you proper protection. Here's the breakdown.
The fundamental difference
| | robots.txt | .htaccess | |---|---|---| | Type | Polite request | Hard block | | Enforcement | Bot's choice | Server enforces | | Scope | Crawling only | All requests | | Effect | Content not indexed | Request denied |
Think of it this way:
- robots.txt is a "Please Don't" sign on your lawn
- .htaccess is a locked gate
How robots.txt works
When a well-behaved crawler visits your site, it first checks robots.txt:
- Bot requests
https://yoursite.com/robots.txt - Bot reads the rules
- Bot decides whether to crawl based on rules
- If allowed, bot proceeds to crawl
The key word is "well-behaved." There's nothing forcing a bot to check robots.txt or follow its rules. It's a convention, not an enforcement mechanism.
Example robots.txt
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Googlebot
Allow: /
Use our robots.txt generator to create these rules automatically.
Try robots.txt Generator →What robots.txt is good for
- Blocking legitimate crawlers that respect the standard
- Documenting your crawling policy
- Establishing intent for legal purposes
- Guiding search engines on what to index
What robots.txt can't do
- Block crawlers that ignore it (Bytespider often does)
- Hide content from determined scrapers
- Prevent any bot from accessing your server
- Retroactively remove content from training data
How .htaccess works
.htaccess rules are processed by Apache before serving any content:
- Request arrives at server
- Apache checks .htaccess rules
- If matched, request is denied with 403 Forbidden
- No content is ever served
The bot never sees your content. The server refuses before any page loads.
Example .htaccess
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC]
RewriteRule .* - [F,L]
</IfModule>
Use our .htaccess generator to create these rules automatically.
Try .htaccess Generator →What .htaccess is good for
- Actually blocking requests at the server level
- Stopping bots that ignore robots.txt
- Protecting content from any request with matching User-Agent
- Rate limiting (with additional modules)
What .htaccess can't do
- Block bots using fake User-Agents
- Block IP ranges easily (though possible, it's unwieldy)
- Work on non-Apache servers (use nginx config instead)
- Establish legal intent the way robots.txt does
The User-Agent problem
Both methods rely on the crawler honestly identifying itself. A bot that sends:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
...looks like a regular browser. Neither robots.txt nor .htaccess will catch it.
This is why sophisticated scraping is hard to block entirely. The legitimate AI crawlers (GPTBot, ClaudeBot, etc.) do identify themselves honestly. That's who we're targeting with these methods.
When bots ignore robots.txt
Bytespider is the poster child for ignoring robots.txt. I've seen consistent reports of it continuing to crawl after being blocked in robots.txt.
For these crawlers, you need:
- robots.txt (for documentation and the principle)
- .htaccess or server-level blocking (for enforcement)
The combination catches both well-behaved and misbehaving bots.
Nginx users
If you're on Nginx, there's no .htaccess. Use server config instead:
if ($http_user_agent ~* "GPTBot|ClaudeBot|Bytespider") {
return 403;
}
Same concept, different syntax. Nginx checks before serving, just like .htaccess on Apache.
The belt-and-suspenders approach
My recommendation: use both.
- robots.txt — Documents your policy, catches compliant crawlers
- Server-level rules — Enforces the policy, catches the rest
Here's why this matters:
- If you only use robots.txt, Bytespider ignores you
- If you only use .htaccess, you have no documented policy (matters legally)
- Using both covers all bases
Order of operations
When a request comes in:
- Server receives request
- .htaccess rules are evaluated (if Apache)
- If blocked, 403 Forbidden is returned
- If allowed, content is served
The bot's robots.txt check happens before it even makes requests (if it checks at all). So:
- robots.txt prevents compliant crawlers from even trying
- .htaccess catches any request that slips through
Edge cases
Caching layers
If you use Cloudflare, Varnish, or another caching layer, the cache might serve content to bots before your origin server sees the request.
Solution: Set up bot blocking at the edge (Cloudflare WAF rules, for example).
CDN-served assets
If your images/assets are served from a CDN, .htaccess on your origin won't protect them.
Solution: Configure bot blocking at the CDN level too.
WordPress caching plugins
Some caching plugins serve cached pages without consulting .htaccess.
Solution: Use a plugin that respects .htaccess, or block at the server/network level.
Comparison summary
| Criterion | robots.txt | .htaccess | |---|---|---| | Stops GPTBot | Yes | Yes | | Stops Bytespider | Often ignored | Yes | | Requires bot cooperation | Yes | No | | Works on shared hosting | Yes | Usually | | Establishes legal intent | Yes | Less clear | | Performance impact | None | Minimal |
My recommendation
Here's the setup I use and recommend:
- Create robots.txt blocking all AI crawlers
- Create .htaccess rules blocking the same crawlers
- If using Cloudflare or similar, add WAF rules too
- Monitor logs to verify blocking is working
Three layers might seem like overkill. It's not. The AI crawler landscape changes frequently, and defense in depth catches what individual layers miss.
Generate your blocking rules in seconds with our free tools.
See also: