Back to blog
gptbotopenairobots-txthtaccess

How to Block GPTBot (OpenAI's Web Crawler)

November 22, 2024(Updated: Dec 5, 2024)5 min read
Share:
ChatGPT era AI technology concept

GPTBot is OpenAI's primary web crawler, used to collect training data for their AI models. It's probably the most talked-about AI crawler right now, and for good reason—if you're blocking any AI crawlers, this is usually the first one on the list.

What is GPTBot?

GPTBot crawls websites to gather text that may be used to train future GPT models. OpenAI has been pretty upfront about this: if GPTBot can access your content, that content might end up in GPT-5 or whatever comes next.

The User-Agent string looks like:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

OpenAI also runs ChatGPT-User, which is different—that's for real-time browsing when a ChatGPT Plus user asks the AI to visit a specific URL. We'll cover that too.

The quick solution

Don't want to read the whole post? Here's the minimum viable block:

robots.txt:

User-agent: GPTBot
Disallow: /

That's it. GPTBot respects robots.txt, so this will prevent it from crawling your site.

Want to skip the copy-paste?

Use our robots.txt generator to create these rules automatically.

Try robots.txt Generator

robots.txt method (recommended)

Add these lines to your robots.txt file at the root of your website:

# Block OpenAI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

# Continue allowing search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Partial blocking

Maybe you want to let GPTBot access some pages but not others. That's supported:

User-agent: GPTBot
Disallow: /private/
Disallow: /premium/
Disallow: /api/
Allow: /blog/
Allow: /docs/

This blocks GPTBot from /private/, /premium/, and /api/ directories while allowing access to /blog/ and /docs/.

Why would you do this? A few reasons:

  • You have paywalled content you want protected but free content you're fine with sharing
  • You want GPT to be able to answer questions about your public documentation
  • You're a news site and want AI to be able to cite your headlines

Should I block ChatGPT-User?

This is a judgment call. ChatGPT-User is used when someone in ChatGPT asks "summarize this article" with a link to your site. Blocking it means ChatGPT can't read that specific page for that specific user.

Some thoughts:

  • If you're worried about training data, ChatGPT-User browsing is supposedly not used for training
  • If you want ChatGPT to be able to cite your content when users ask about it, keep it unblocked
  • If you don't want AI interacting with your content at all, block it

Personally, I block both on content I want to protect, but leave ChatGPT-User unblocked on public documentation.

Server-level blocking

For extra insurance, you can block GPTBot at the server level. This catches any requests even if they somehow bypass robots.txt.

Apache (.htaccess)

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ChatGPT-User [NC]
RewriteRule .* - [F,L]
</IfModule>
Want to skip the copy-paste?

Use our .htaccess generator to create these rules automatically.

Try .htaccess Generator

Nginx

if ($http_user_agent ~* "GPTBot|ChatGPT") {
    return 403;
}

Add this inside your server block.

Cloudflare

Create a custom WAF rule:

  1. Go to Security → WAF → Custom Rules
  2. Create a new rule
  3. Set: (http.user_agent contains "GPTBot") or (http.user_agent contains "ChatGPT")
  4. Action: Block

Verifying the block

Want to make sure it's working? A few methods:

curl test

curl -A "GPTBot/1.0" -I https://yoursite.com/

If your server-level block is working, you should see HTTP/1.1 403 Forbidden.

robots.txt test

For robots.txt verification, check Google Search Console's robots.txt tester, or use a third-party tool to validate the syntax.

Log monitoring

Check your access logs for GPTBot requests:

grep -i "gptbot" /var/log/apache2/access.log | tail -20

You should see 403 responses if your block is working.

Common questions

Does blocking GPTBot hurt my SEO?

No. GPTBot is separate from Googlebot. OpenAI uses Bingbot for some things, but blocking GPTBot specifically has no effect on search rankings.

Will my content be removed from ChatGPT?

No. Blocking GPTBot only prevents future crawling. If your content was crawled before you added the block, it may already be in their training data. OpenAI doesn't have a process for removing specific content from trained models.

Can I opt out of OpenAI entirely?

OpenAI has no formal opt-out process beyond robots.txt. Blocking their crawlers is about all you can do technically. For legal concerns, consult an attorney.

How often does GPTBot crawl?

This varies. High-traffic sites report seeing GPTBot daily. Smaller sites might see it weekly or less. Check your logs to see the pattern for your site.

My take

Honestly, GPTBot is one of the more "honest" crawlers. They identify themselves clearly, publish documentation about their crawler, and respect robots.txt. Compare that to some scrapers that disguise themselves as regular browsers.

That said, if you don't want your content in AI training data, blocking GPTBot is straightforward and effective. The robots.txt method works, and adding server-level blocking gives you belt-and-suspenders protection.

For the complete list of AI crawlers to block, see our AI Crawler List for 2025.

Found this helpful? Share it with others.

Share:

Ready to block AI crawlers?

Use our free generators to create your blocking rules in seconds.