How to Detect AI Crawlers in Your Server Logs
Before you can block AI crawlers, it helps to know which ones are actually visiting. Your server logs tell the story. Here's how to read them.
Apache access logs
The default Apache log location is /var/log/apache2/access.log (Debian/Ubuntu) or /var/log/httpd/access_log (RHEL/CentOS).
Quick scan for all bots
grep -i "bot\|crawler\|spider" /var/log/apache2/access.log | head -100
This catches most crawlers since they typically include these words in their User-Agent.
Find specific AI crawlers
grep -iE "GPTBot|ClaudeBot|Anthropic|Google-Extended|CCBot|Bytespider|PerplexityBot" /var/log/apache2/access.log
Count requests by crawler
grep -iE "GPTBot|ClaudeBot|Bytespider" /var/log/apache2/access.log | \
awk '{print $12}' | sort | uniq -c | sort -rn
This shows how many requests each bot made.
See what they're crawling
grep -i "GPTBot" /var/log/apache2/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20
Shows the top 20 pages GPTBot requested.
Nginx access logs
Default location: /var/log/nginx/access.log
Find AI crawlers
grep -iE "GPTBot|ClaudeBot|Anthropic|Google-Extended|CCBot|Bytespider" /var/log/nginx/access.log
Extract User-Agent field
Nginx logs put the User-Agent in different positions depending on your log format. Default combined format:
grep -i "GPTBot" /var/log/nginx/access.log | awk -F'"' '{print $6}' | sort | uniq
Count requests per day
grep -i "GPTBot" /var/log/nginx/access.log | awk '{print $4}' | cut -d: -f1 | uniq -c
Shows request counts by date.
What to look for
Legitimate AI crawler User-Agents
These are the ones that identify themselves properly:
GPTBot/1.0 (+https://openai.com/gptbot)
ClaudeBot/1.0 (+https://anthropic.com/bot)
Google-Extended
CCBot/2.0 (+https://commoncrawl.org/faq/)
Bytespider
PerplexityBot/1.0
Red flags
- High request volume in short time periods
- Crawling the same pages repeatedly
- Ignoring robots.txt (check if blocked pages are still being hit)
- Requests for sensitive paths (/admin, /wp-login, etc.)
Checking if your blocks are working
If you've added robots.txt or .htaccess blocks, verify they're working:
Look for 403 responses
grep -i "GPTBot" /var/log/apache2/access.log | awk '{print $9}' | sort | uniq -c
The $9 field is typically the HTTP status code. You should see mostly 403s if your block is working.
Check for 200s that shouldn't be there
grep -i "Bytespider" /var/log/apache2/access.log | grep " 200 "
If Bytespider is getting 200 OK responses after you've blocked it, your robots.txt isn't being respected (expected) and you need server-level blocking.
Automated monitoring
Simple bash script
Create a daily report:
#!/bin/bash
# ai-bot-report.sh
LOG="/var/log/apache2/access.log"
BOTS="GPTBot|ClaudeBot|Anthropic|Google-Extended|CCBot|Bytespider|PerplexityBot"
echo "=== AI Crawler Report $(date) ==="
echo ""
echo "Request counts by bot:"
grep -iE "$BOTS" "$LOG" | grep "$(date +%d/%b/%Y)" | \
grep -oiE "$BOTS" | sort | uniq -c | sort -rn
echo ""
echo "Status code breakdown:"
grep -iE "$BOTS" "$LOG" | grep "$(date +%d/%b/%Y)" | \
awk '{print $9}' | sort | uniq -c | sort -rn
Run with cron:
0 0 * * * /path/to/ai-bot-report.sh >> /var/log/ai-bot-reports.log
GoAccess for visual analysis
If you want a prettier view, GoAccess is excellent:
sudo apt install goaccess # Debian/Ubuntu
goaccess /var/log/apache2/access.log --log-format=COMBINED -o report.html
Open report.html in a browser. It shows a breakdown of all visitors including bots.
Shared hosting without log access
Many shared hosts don't give you raw log access. Options:
cPanel / Plesk logs
Check your hosting control panel for "Access Logs" or "Raw Log Access."
AWStats or similar
Many hosts have AWStats or Webalizer pre-installed. These show bot traffic in their reports.
Add your own logging
If you can run PHP or similar, add logging:
// log-bots.php - include at top of pages
$ua = $_SERVER['HTTP_USER_AGENT'] ?? '';
$bots = ['GPTBot', 'ClaudeBot', 'Bytespider'];
foreach ($bots as $bot) {
if (stripos($ua, $bot) !== false) {
$log = date('Y-m-d H:i:s') . " | $bot | " . $_SERVER['REQUEST_URI'] . "\n";
file_put_contents('bot-log.txt', $log, FILE_APPEND);
}
}
Not as comprehensive as server logs, but better than nothing.
What the data tells you
Once you have visibility into AI crawler traffic, you can answer:
-
Which bots are actually hitting your site? Maybe you've been worrying about GPTBot but Bytespider is your real problem.
-
How often do they visit? Daily? Weekly? This affects how urgent blocking is.
-
Are your blocks working? If you see 200s instead of 403s, something's wrong.
-
What are they after? If they're hammering one section, you might only need to protect that.
Taking action
Once you've identified what's hitting your site, block it:
Generate your blocking rules in seconds with our free tools.
After blocking, re-check your logs to verify. You should see 403 Forbidden responses (for server-level blocking) or no requests at all (if they're respecting robots.txt).