Back to blog
robots-txthtaccesscomparison

robots.txt vs .htaccess: Which Should You Use to Block AI Crawlers?

December 5, 2024(Updated: Dec 5, 2024)5 min read
Share:
Comparison and decision making concept

Two tools, two different jobs. Understanding when to use robots.txt vs .htaccess saves you headaches and gives you proper protection. Here's the breakdown.

The fundamental difference

| | robots.txt | .htaccess | |---|---|---| | Type | Polite request | Hard block | | Enforcement | Bot's choice | Server enforces | | Scope | Crawling only | All requests | | Effect | Content not indexed | Request denied |

Think of it this way:

  • robots.txt is a "Please Don't" sign on your lawn
  • .htaccess is a locked gate

How robots.txt works

When a well-behaved crawler visits your site, it first checks robots.txt:

  1. Bot requests https://yoursite.com/robots.txt
  2. Bot reads the rules
  3. Bot decides whether to crawl based on rules
  4. If allowed, bot proceeds to crawl

The key word is "well-behaved." There's nothing forcing a bot to check robots.txt or follow its rules. It's a convention, not an enforcement mechanism.

Example robots.txt

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Googlebot
Allow: /
Want to skip the copy-paste?

Use our robots.txt generator to create these rules automatically.

Try robots.txt Generator

What robots.txt is good for

  • Blocking legitimate crawlers that respect the standard
  • Documenting your crawling policy
  • Establishing intent for legal purposes
  • Guiding search engines on what to index

What robots.txt can't do

  • Block crawlers that ignore it (Bytespider often does)
  • Hide content from determined scrapers
  • Prevent any bot from accessing your server
  • Retroactively remove content from training data

How .htaccess works

.htaccess rules are processed by Apache before serving any content:

  1. Request arrives at server
  2. Apache checks .htaccess rules
  3. If matched, request is denied with 403 Forbidden
  4. No content is ever served

The bot never sees your content. The server refuses before any page loads.

Example .htaccess

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC]
RewriteRule .* - [F,L]
</IfModule>
Want to skip the copy-paste?

Use our .htaccess generator to create these rules automatically.

Try .htaccess Generator

What .htaccess is good for

  • Actually blocking requests at the server level
  • Stopping bots that ignore robots.txt
  • Protecting content from any request with matching User-Agent
  • Rate limiting (with additional modules)

What .htaccess can't do

  • Block bots using fake User-Agents
  • Block IP ranges easily (though possible, it's unwieldy)
  • Work on non-Apache servers (use nginx config instead)
  • Establish legal intent the way robots.txt does

The User-Agent problem

Both methods rely on the crawler honestly identifying itself. A bot that sends:

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36

...looks like a regular browser. Neither robots.txt nor .htaccess will catch it.

This is why sophisticated scraping is hard to block entirely. The legitimate AI crawlers (GPTBot, ClaudeBot, etc.) do identify themselves honestly. That's who we're targeting with these methods.

When bots ignore robots.txt

Bytespider is the poster child for ignoring robots.txt. I've seen consistent reports of it continuing to crawl after being blocked in robots.txt.

For these crawlers, you need:

  1. robots.txt (for documentation and the principle)
  2. .htaccess or server-level blocking (for enforcement)

The combination catches both well-behaved and misbehaving bots.

Nginx users

If you're on Nginx, there's no .htaccess. Use server config instead:

if ($http_user_agent ~* "GPTBot|ClaudeBot|Bytespider") {
    return 403;
}

Same concept, different syntax. Nginx checks before serving, just like .htaccess on Apache.

The belt-and-suspenders approach

My recommendation: use both.

  1. robots.txt — Documents your policy, catches compliant crawlers
  2. Server-level rules — Enforces the policy, catches the rest

Here's why this matters:

  • If you only use robots.txt, Bytespider ignores you
  • If you only use .htaccess, you have no documented policy (matters legally)
  • Using both covers all bases

Order of operations

When a request comes in:

  1. Server receives request
  2. .htaccess rules are evaluated (if Apache)
  3. If blocked, 403 Forbidden is returned
  4. If allowed, content is served

The bot's robots.txt check happens before it even makes requests (if it checks at all). So:

  • robots.txt prevents compliant crawlers from even trying
  • .htaccess catches any request that slips through

Edge cases

Caching layers

If you use Cloudflare, Varnish, or another caching layer, the cache might serve content to bots before your origin server sees the request.

Solution: Set up bot blocking at the edge (Cloudflare WAF rules, for example).

CDN-served assets

If your images/assets are served from a CDN, .htaccess on your origin won't protect them.

Solution: Configure bot blocking at the CDN level too.

WordPress caching plugins

Some caching plugins serve cached pages without consulting .htaccess.

Solution: Use a plugin that respects .htaccess, or block at the server/network level.

Comparison summary

| Criterion | robots.txt | .htaccess | |---|---|---| | Stops GPTBot | Yes | Yes | | Stops Bytespider | Often ignored | Yes | | Requires bot cooperation | Yes | No | | Works on shared hosting | Yes | Usually | | Establishes legal intent | Yes | Less clear | | Performance impact | None | Minimal |

My recommendation

Here's the setup I use and recommend:

  1. Create robots.txt blocking all AI crawlers
  2. Create .htaccess rules blocking the same crawlers
  3. If using Cloudflare or similar, add WAF rules too
  4. Monitor logs to verify blocking is working

Three layers might seem like overkill. It's not. The AI crawler landscape changes frequently, and defense in depth catches what individual layers miss.

Skip the manual work

Generate your blocking rules in seconds with our free tools.

See also:

Found this helpful? Share it with others.

Share:

Ready to block AI crawlers?

Use our free generators to create your blocking rules in seconds.