Back to blog
bytespiderbytedancehtaccessnginx

How to Block Bytespider (ByteDance's Aggressive Crawler)

November 26, 2024(Updated: Dec 5, 2024)4 min read
Share:
Digital security and protection concept

Bytespider is ByteDance's web crawler—the company behind TikTok. And unlike most AI crawlers, it has a reputation for not playing nice.

Bytespider has been widely reported to ignore robots.txt. You NEED server-level blocking for this one.

The Bytespider problem

Most AI crawlers respect robots.txt. You add a Disallow rule, they stop coming. Bytespider? Not so much.

I've seen multiple reports—and experienced it myself—of Bytespider continuing to crawl sites well after being blocked in robots.txt. ByteDance claims they respect robots.txt, but server logs tell a different story.

Whether this is intentional or a bug in their system, the effect is the same: robots.txt alone won't stop them.

What Bytespider crawls for

Bytespider collects data for:

  • TikTok's recommendation algorithm
  • ByteDance's AI products
  • Douyin (TikTok's Chinese version)
  • Various machine learning initiatives

The User-Agent looks like:

Bytespider; spider-feedback@bytedance.com

They're pretty aggressive about crawl frequency too. I've seen Bytespider hit sites thousands of times in a single day.

The solution: Server-level blocking

Since robots.txt isn't reliable, you need to block at the server level.

Apache (.htaccess)

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bytedance [NC]
RewriteRule .* - [F,L]
</IfModule>

I block both "Bytespider" and "Bytedance" in case they use alternate User-Agent strings.

Want to skip the copy-paste?

Use our .htaccess generator to create these rules automatically.

Try .htaccess Generator

Nginx

if ($http_user_agent ~* "Bytespider|Bytedance") {
    return 403;
}

Add this inside your server block.

Cloudflare

This is honestly the easiest option if you're using Cloudflare:

  1. Security → WAF → Custom Rules
  2. Create rule: (http.user_agent contains "Bytespider") or (http.user_agent contains "Bytedance")
  3. Action: Block

Cloudflare's Bot Fight Mode also helps catch Bytespider even when it tries to disguise itself.

Still add robots.txt

Even though Bytespider ignores robots.txt, add the rule anyway:

User-agent: Bytespider
Disallow: /

Why? Documentation. If you ever need to take legal action or report abusive behavior, having a clear robots.txt establishes your intent to block them. It shows you did everything the "right" way and they ignored it.

Want to skip the copy-paste?

Use our robots.txt generator to create these rules automatically.

Try robots.txt Generator

IP-based blocking

Some people go nuclear and block ByteDance's entire IP range. This is aggressive and might cause collateral damage, but if you're really fed up:

ByteDance uses various IP ranges, but they rotate frequently. A more maintainable approach is to use services like:

  • Cloudflare Bot Fight Mode
  • AWS WAF with managed rule sets
  • Your host's anti-bot features

These maintain updated blocklists so you don't have to.

Checking your logs

See if Bytespider is hitting your site:

grep -i "bytespider\|bytedance" /var/log/apache2/access.log | tail -50

Look at the response codes:

  • 200 = They're getting through
  • 403 = Your block is working

If you're seeing 200s after implementing blocks, double-check your .htaccess syntax and that mod_rewrite is enabled.

The bigger picture

Bytespider's behavior is frustrating but not surprising. TikTok/ByteDance has a history of aggressive data collection. They're optimizing for their business, not web etiquette.

This is why I recommend a layered approach for all AI crawlers:

  1. robots.txt (catches well-behaved bots)
  2. Server-level blocking (catches most others)
  3. Cloudflare or similar WAF (catches the rest)

Bytespider is just the most obvious example of why you need all three layers.

Alternative: Rate limiting

If outright blocking is too aggressive, you can rate limit instead. In Nginx:

limit_req_zone $http_user_agent zone=bytedance:10m rate=1r/m;

if ($http_user_agent ~* "Bytespider|Bytedance") {
    limit_req zone=bytedance burst=5 nodelay;
}

This limits Bytespider to about one request per minute. They'll get your content eventually, but they won't hammer your server.

Verification

curl -A "Bytespider" -I https://yoursite.com/

Should return 403 Forbidden.

If it returns 200, check:

  • Is your .htaccess being processed? (AllowOverride setting)
  • Is mod_rewrite enabled?
  • Are there conflicting rules?

My recommendation

For Bytespider specifically:

  1. Add robots.txt block (for documentation)
  2. Add .htaccess or nginx block (for enforcement)
  3. Consider Cloudflare if you're seeing persistent crawling

Don't rely on robots.txt alone. Bytespider has demonstrated they don't reliably respect it.

Skip the manual work

Generate your blocking rules in seconds with our free tools.

See also:

Found this helpful? Share it with others.

Share:

Ready to block AI crawlers?

Use our free generators to create your blocking rules in seconds.