Back to blog
server-logsdetectionmonitoring

How to Detect AI Crawlers in Your Server Logs

December 5, 2024(Updated: Dec 5, 2024)4 min read
Share:
Code and server logs on a screen

Before you can block AI crawlers, it helps to know which ones are actually visiting. Your server logs tell the story. Here's how to read them.

Apache access logs

The default Apache log location is /var/log/apache2/access.log (Debian/Ubuntu) or /var/log/httpd/access_log (RHEL/CentOS).

Quick scan for all bots

grep -i "bot\|crawler\|spider" /var/log/apache2/access.log | head -100

This catches most crawlers since they typically include these words in their User-Agent.

Find specific AI crawlers

grep -iE "GPTBot|ClaudeBot|Anthropic|Google-Extended|CCBot|Bytespider|PerplexityBot" /var/log/apache2/access.log

Count requests by crawler

grep -iE "GPTBot|ClaudeBot|Bytespider" /var/log/apache2/access.log | \
  awk '{print $12}' | sort | uniq -c | sort -rn

This shows how many requests each bot made.

See what they're crawling

grep -i "GPTBot" /var/log/apache2/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20

Shows the top 20 pages GPTBot requested.

Nginx access logs

Default location: /var/log/nginx/access.log

Find AI crawlers

grep -iE "GPTBot|ClaudeBot|Anthropic|Google-Extended|CCBot|Bytespider" /var/log/nginx/access.log

Extract User-Agent field

Nginx logs put the User-Agent in different positions depending on your log format. Default combined format:

grep -i "GPTBot" /var/log/nginx/access.log | awk -F'"' '{print $6}' | sort | uniq

Count requests per day

grep -i "GPTBot" /var/log/nginx/access.log | awk '{print $4}' | cut -d: -f1 | uniq -c

Shows request counts by date.

What to look for

Legitimate AI crawler User-Agents

These are the ones that identify themselves properly:

GPTBot/1.0 (+https://openai.com/gptbot)
ClaudeBot/1.0 (+https://anthropic.com/bot)
Google-Extended
CCBot/2.0 (+https://commoncrawl.org/faq/)
Bytespider
PerplexityBot/1.0

Red flags

  • High request volume in short time periods
  • Crawling the same pages repeatedly
  • Ignoring robots.txt (check if blocked pages are still being hit)
  • Requests for sensitive paths (/admin, /wp-login, etc.)

Checking if your blocks are working

If you've added robots.txt or .htaccess blocks, verify they're working:

Look for 403 responses

grep -i "GPTBot" /var/log/apache2/access.log | awk '{print $9}' | sort | uniq -c

The $9 field is typically the HTTP status code. You should see mostly 403s if your block is working.

Check for 200s that shouldn't be there

grep -i "Bytespider" /var/log/apache2/access.log | grep " 200 "

If Bytespider is getting 200 OK responses after you've blocked it, your robots.txt isn't being respected (expected) and you need server-level blocking.

Automated monitoring

Simple bash script

Create a daily report:

#!/bin/bash
# ai-bot-report.sh

LOG="/var/log/apache2/access.log"
BOTS="GPTBot|ClaudeBot|Anthropic|Google-Extended|CCBot|Bytespider|PerplexityBot"

echo "=== AI Crawler Report $(date) ==="
echo ""
echo "Request counts by bot:"
grep -iE "$BOTS" "$LOG" | grep "$(date +%d/%b/%Y)" | \
  grep -oiE "$BOTS" | sort | uniq -c | sort -rn
echo ""
echo "Status code breakdown:"
grep -iE "$BOTS" "$LOG" | grep "$(date +%d/%b/%Y)" | \
  awk '{print $9}' | sort | uniq -c | sort -rn

Run with cron:

0 0 * * * /path/to/ai-bot-report.sh >> /var/log/ai-bot-reports.log

GoAccess for visual analysis

If you want a prettier view, GoAccess is excellent:

sudo apt install goaccess  # Debian/Ubuntu

goaccess /var/log/apache2/access.log --log-format=COMBINED -o report.html

Open report.html in a browser. It shows a breakdown of all visitors including bots.

Shared hosting without log access

Many shared hosts don't give you raw log access. Options:

cPanel / Plesk logs

Check your hosting control panel for "Access Logs" or "Raw Log Access."

AWStats or similar

Many hosts have AWStats or Webalizer pre-installed. These show bot traffic in their reports.

Add your own logging

If you can run PHP or similar, add logging:

// log-bots.php - include at top of pages
$ua = $_SERVER['HTTP_USER_AGENT'] ?? '';
$bots = ['GPTBot', 'ClaudeBot', 'Bytespider'];

foreach ($bots as $bot) {
    if (stripos($ua, $bot) !== false) {
        $log = date('Y-m-d H:i:s') . " | $bot | " . $_SERVER['REQUEST_URI'] . "\n";
        file_put_contents('bot-log.txt', $log, FILE_APPEND);
    }
}

Not as comprehensive as server logs, but better than nothing.

What the data tells you

Once you have visibility into AI crawler traffic, you can answer:

  1. Which bots are actually hitting your site? Maybe you've been worrying about GPTBot but Bytespider is your real problem.

  2. How often do they visit? Daily? Weekly? This affects how urgent blocking is.

  3. Are your blocks working? If you see 200s instead of 403s, something's wrong.

  4. What are they after? If they're hammering one section, you might only need to protect that.

Taking action

Once you've identified what's hitting your site, block it:

Skip the manual work

Generate your blocking rules in seconds with our free tools.

After blocking, re-check your logs to verify. You should see 403 Forbidden responses (for server-level blocking) or no requests at all (if they're respecting robots.txt).

Found this helpful? Share it with others.

Share:

Ready to block AI crawlers?

Use our free generators to create your blocking rules in seconds.