How to Block Google-Extended (Without Hurting Your SEO)
This is probably the most misunderstood AI crawler. People block it thinking they're blocking Google, panic when they realize what they've done, then unblock it worrying about SEO. Let me clear this up.
Google-Extended is NOT Googlebot
This is the key point. Google runs two separate crawlers:
| Crawler | Purpose | Affects SEO | |---------|---------|-------------| | Googlebot | Search indexing | Yes | | Google-Extended | AI training (Gemini, etc.) | No |
Blocking Google-Extended does not affect your search rankings. Your pages will still appear in Google Search. This crawler is specifically for training data collection for Google's AI products like Gemini.
I've seen site owners panic-post in forums about accidentally blocking Google. If you only blocked Google-Extended, you're fine. Take a breath.
The quick block
robots.txt:
User-agent: Google-Extended
Disallow: /
That's all you need. Keep Googlebot allowed:
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot
Allow: /
Use our robots.txt generator to create these rules automatically.
Try robots.txt Generator →Full robots.txt example
Here's a complete example blocking AI crawlers while keeping search engines happy:
# Block AI training crawlers
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
# Allow search engine crawlers
User-agent: Googlebot
Allow: /
User-agent: Googlebot-Image
Allow: /
User-agent: Googlebot-News
Allow: /
User-agent: Bingbot
Allow: /
# Default allow for everything else
User-agent: *
Allow: /
The explicit Allow: / for Googlebot isn't strictly necessary since it's not matched by the Google-Extended rule, but I include it for clarity. Makes it obvious that we're intentionally keeping search indexing.
Server-level blocking
Google-Extended reliably respects robots.txt, so server-level blocking is optional. But if you want it:
Apache (.htaccess)
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC]
RewriteRule .* - [F,L]
</IfModule>
Be careful with this. Don't accidentally match "Google" without "Extended" or you'll block Googlebot too.
Nginx
if ($http_user_agent ~* "Google-Extended") {
return 403;
}
Again, be specific. Use "Google-Extended" not just "Google".
When blocking at server level, be very precise with your pattern. Matching "Google" alone will block Googlebot and tank your SEO.
What Google-Extended crawls for
According to Google's documentation, Google-Extended crawls web content to train AI models like Gemini (formerly Bard). This includes:
- Gemini app
- Gemini API
- AI features in Google products
Notably, it's separate from:
- Search rankings (Googlebot)
- AI Overviews in Search (that uses existing Search index)
So even after blocking Google-Extended, your content might still appear in AI Overviews. Google uses its existing search index for that, not the Extended crawler.
The AI Overviews question
"But wait, if I block Google-Extended, will I still show up in AI Overviews?"
Probably yes. AI Overviews (the AI-generated summaries at the top of some search results) appear to use the regular search index. Blocking Google-Extended doesn't seem to affect them.
Google hasn't been super clear about this, which is frustrating. Their documentation suggests AI Overviews use Googlebot's index, not Google-Extended's. But they could change that.
If you want to opt out of AI Overviews specifically, Google has a separate mechanism involving the nosnippet or max-snippet directives, though these affect regular snippets too.
Common mistakes
Mistake 1: Blocking "Google" broadly
# DON'T DO THIS
User-agent: Google
Disallow: /
This doesn't block Google-Extended specifically and might not even work as intended. Be explicit:
# DO THIS
User-agent: Google-Extended
Disallow: /
Mistake 2: Using wildcards incorrectly
# PROBLEMATIC
User-agent: *
Disallow: /ai-training/
This blocks all bots from that path, including search crawlers. Better:
# BETTER
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
Mistake 3: Forgetting robots.txt is cached
After updating your robots.txt, it can take days for crawlers to see the changes. They cache the file. If you need immediate blocking, use server-level rules alongside robots.txt.
Verifying it works
Check your robots.txt is accessible
curl https://yoursite.com/robots.txt
Make sure the Google-Extended rule is visible.
Google Search Console
Search Console has a robots.txt tester. Use it to confirm your syntax is correct and that Google-Extended is blocked while Googlebot is allowed.
Watch your logs
grep -i "Google-Extended" /var/log/apache2/access.log
If you're getting 403s, your server-level block is working. If you're getting 200s, either the block isn't applied or they're not respecting robots.txt (unlikely for Google).
My recommendation
For most sites, the robots.txt block is sufficient. Google respects it reliably. Adding server-level blocking is fine if you want extra assurance, but be careful with your regex to avoid blocking Googlebot.
If you're paranoid about AI training but want to stay in search results, blocking Google-Extended while keeping Googlebot is exactly the right approach. That's literally what Google created the separate crawler for.
Use our robots.txt generator to create these rules automatically.
Try robots.txt Generator →See also: