Two tools, two different jobs. Understanding when to use robots.txt vs .htaccess saves you headaches and gives you proper protection. Here's the breakdown.

The fundamental difference

	robots.txt	.htaccess
Type	Polite request	Hard block
Enforcement	Bot's choice	Server enforces
Scope	Crawling only	All requests
Effect	Content not indexed	Request denied

Think of it this way:

robots.txt is a "Please Don't" sign on your lawn
.htaccess is a locked gate

How robots.txt works

When a well-behaved crawler visits your site, it first checks robots.txt:

Bot requests https://yoursite.com/robots.txt
Bot reads the rules
Bot decides whether to crawl based on rules
If allowed, bot proceeds to crawl

The key word is "well-behaved." There's nothing forcing a bot to check robots.txt or follow its rules. It's a convention, not an enforcement mechanism.

Example robots.txt

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Googlebot
Allow: /

Want to skip the copy-paste?

Use our robots.txt generator to create these rules automatically.

Try robots.txt Generator →

What robots.txt is good for

Blocking legitimate crawlers that respect the standard
Documenting your crawling policy
Establishing intent for legal purposes
Guiding search engines on what to index

What robots.txt can't do

Block crawlers that ignore it (Bytespider often does)
Hide content from determined scrapers
Prevent any bot from accessing your server
Retroactively remove content from training data

How .htaccess works

.htaccess rules are processed by Apache before serving any content:

Request arrives at server
Apache checks .htaccess rules
If matched, request is denied with 403 Forbidden
No content is ever served

The bot never sees your content. The server refuses before any page loads.

Example .htaccess

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC]
RewriteRule .* - [F,L]
</IfModule>

Want to skip the copy-paste?

Use our .htaccess generator to create these rules automatically.

Try .htaccess Generator →

What .htaccess is good for

Actually blocking requests at the server level
Stopping bots that ignore robots.txt
Protecting content from any request with matching User-Agent
Rate limiting (with additional modules)

What .htaccess can't do

Block bots using fake User-Agents
Block IP ranges easily (though possible, it's unwieldy)
Work on non-Apache servers (use nginx config instead)
Establish legal intent the way robots.txt does

The User-Agent problem

Both methods rely on the crawler honestly identifying itself. A bot that sends:

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36

...looks like a regular browser. Neither robots.txt nor .htaccess will catch it.

This is why sophisticated scraping is hard to block entirely. The legitimate AI crawlers (GPTBot, ClaudeBot, etc.) do identify themselves honestly. That's who we're targeting with these methods.

When bots ignore robots.txt

Bytespider is the poster child for ignoring robots.txt. I've seen consistent reports of it continuing to crawl after being blocked in robots.txt.

For these crawlers, you need:

robots.txt (for documentation and the principle)
.htaccess or server-level blocking (for enforcement)

The combination catches both well-behaved and misbehaving bots.

Nginx users

If you're on Nginx, there's no .htaccess. Use server config instead:

if ($http_user_agent ~* "GPTBot|ClaudeBot|Bytespider") {
    return 403;
}

Same concept, different syntax. Nginx checks before serving, just like .htaccess on Apache.

The belt-and-suspenders approach

My recommendation: use both.

robots.txt — Documents your policy, catches compliant crawlers
Server-level rules — Enforces the policy, catches the rest

Here's why this matters:

If you only use robots.txt, Bytespider ignores you
If you only use .htaccess, you have no documented policy (matters legally)
Using both covers all bases

Order of operations

When a request comes in:

Server receives request
.htaccess rules are evaluated (if Apache)
If blocked, 403 Forbidden is returned
If allowed, content is served

The bot's robots.txt check happens before it even makes requests (if it checks at all). So:

robots.txt prevents compliant crawlers from even trying
.htaccess catches any request that slips through

Edge cases

Caching layers

If you use Cloudflare, Varnish, or another caching layer, the cache might serve content to bots before your origin server sees the request.

Solution: Set up bot blocking at the edge (Cloudflare WAF rules, for example).

CDN-served assets

If your images/assets are served from a CDN, .htaccess on your origin won't protect them.

Solution: Configure bot blocking at the CDN level too.

WordPress caching plugins

Some caching plugins serve cached pages without consulting .htaccess.

Solution: Use a plugin that respects .htaccess, or block at the server/network level.

Comparison summary

Criterion	robots.txt	.htaccess
Stops GPTBot	Yes	Yes
Stops Bytespider	Often ignored	Yes
Requires bot cooperation	Yes	No
Works on shared hosting	Yes	Usually
Establishes legal intent	Yes	Less clear
Performance impact	None	Minimal

My recommendation

Here's the setup I use and recommend:

Create robots.txt blocking all AI crawlers
Create .htaccess rules blocking the same crawlers
If using Cloudflare or similar, add WAF rules too
Monitor logs to verify blocking is working

Three layers might seem like overkill. It's not. The AI crawler landscape changes frequently, and defense in depth catches what individual layers miss.

Skip the manual work

Generate your blocking rules in seconds with our free tools.

robots.txt Generator .htaccess Generator

robots.txt vs .htaccess: Which Should You Use to Block AI Crawlers?