AI companies are scraping the web at an unprecedented scale to train their models. Your content—blog posts, images, code, documentation—might already be in several training datasets. And website owners are understandably asking: Is this legal? Can I stop it? Do I have any recourse?

The honest answer is that the legal landscape is still being determined. Courts are actively working through these questions, and the outcomes will shape the internet for decades. Here's where things stand.

This article provides general information about AI scraping and copyright issues. It is not legal advice. For guidance specific to your situation, consult a qualified attorney.

The current legal landscape

The core question is whether using copyrighted content to train AI models constitutes fair use (in the US) or falls under similar exceptions elsewhere. There's no clear precedent yet, which is why so many lawsuits are currently working through the courts.

What we do know:

Traditional scraping for search engines has generally been considered acceptable under fair use.
AI training is different—it creates derivative works (model outputs) and the companies are profiting significantly.
No law explicitly addresses AI training, leaving courts to apply existing copyright frameworks to novel technology.

The outcome matters enormously. If courts rule that AI training is fair use, companies can continue scraping everything. If not, they'll need licenses or face massive damages.

Key lawsuits and their implications

Several high-profile cases are testing these questions. Here are the ones to watch:

The New York Times vs. OpenAI and Microsoft

The Times filed suit alleging that OpenAI and Microsoft illegally used millions of their articles to train GPT models. Their argument centers on:

ChatGPT can reproduce Times articles nearly verbatim
OpenAI bypassed paywalls during training
The AI competes directly with their journalism
No license was obtained for the training data

This case is significant because The New York Times has resources to pursue it fully and is a sympathetic plaintiff—a journalism organization versus a $100B tech company.

Getty Images vs. Stability AI

Getty sued Stability AI (maker of Stable Diffusion) for training on over 12 million Getty images without authorization. Their evidence includes AI-generated images that contain fragments of Getty's watermark—pretty damning proof of training data source.

This case is important for visual content creators. Unlike text, where fair use arguments are more nuanced, reproducing significant portions of copyrighted images is harder to defend.

Sarah Silverman and Authors vs. Meta and OpenAI

Multiple authors, including Sarah Silverman, have sued AI companies for training on their books. These books were allegedly obtained from "shadow libraries" (piracy sites), adding another legal dimension—not just unauthorized use, but use of pirated materials.

Programmers vs. GitHub Copilot

A class action challenges GitHub Copilot for training on open source code without attribution. The argument is that even if code is publicly available, licenses often require attribution that Copilot doesn't provide.

Copyright considerations

What's protected

Written content (articles, books, blog posts)
Images and photographs
Software code
Music and audio
Videos

All of these are being scraped for AI training.

The fair use defense

AI companies primarily argue fair use, which considers:

Purpose and character — Is the use transformative? Commercial?
Nature of the copyrighted work — Is it creative or factual?
Amount used — How much of the work was taken?
Market effect — Does it harm the market for the original?

For AI training:

It's commercial use at massive scale (weighs against fair use)
The works are creative (weighs against fair use)
Entire works are ingested (weighs against fair use)
AI-generated content competes with human creators (weighs against fair use)

On the other hand:

Training creates something transformative (new capability, not just copying)
Individual snippets in outputs may be small
The training data isn't republished directly

Courts will have to balance these factors. Reasonable experts disagree on the outcome.

The robots.txt question

Here's something that surprises many website owners: robots.txt has essentially no legal weight.

robots.txt is a voluntary standard. It's a request, not an access control. Search engines respect it because it's in everyone's interest, but:

There's no law requiring compliance
Ignoring robots.txt isn't illegal by itself
It doesn't establish any legal protection for your content

That said, some legal theories do incorporate it:

Terms of Service violations — If your ToS prohibits scraping, robots.txt can serve as notice
Trespass to chattels — Ignoring robots.txt while overloading servers might contribute to this claim
Contract claims — Some courts view robots.txt as part of implied terms of access

Bottom line: Keep using robots.txt to signal your preferences, but don't rely on it as legal protection.

GDPR and data protection

If your site handles user data and you're subject to GDPR (or similar regulations), there's another angle. AI training on personal data may violate data protection laws:

Training data might include personal information from comments, reviews, or forums
Users didn't consent to their data being used for AI training
The "legitimate interest" basis is questionable for this use

Italy temporarily banned ChatGPT over these concerns. Other regulators are investigating. This doesn't help with scraping of your copyrighted content, but it adds pressure on AI companies to be more careful about what they ingest.

What website owners can actually do

Given the legal uncertainty, here's what's practical:

Technical measures

You can't prevent scraping entirely, but you can make it harder:

robots.txt — Use our robots.txt Generator to block known AI crawlers (GPTBot, Claude-Web, etc.)
Server-level blocking — Block by user agent with .htaccess or nginx config
Rate limiting — Slow down aggressive scrapers
CAPTCHA or bot detection — For particularly valuable content

These won't stop determined scrapers, but they document that you didn't consent and make casual scraping harder.

Over a third of the top websites now block GPTBot, and the number is growing. It's becoming standard practice to explicitly deny access to AI crawlers.

Legal measures

Consider these steps:

Update your Terms of Service:

Add explicit language prohibiting scraping for AI training. Example language (not legal advice—have a lawyer review):

"Automated collection of content from this website for the purpose of training machine learning or artificial intelligence systems is prohibited without express written permission."

Add copyright notices:

Make your copyright claims explicit. While not required for protection, it removes any "innocent infringement" defense.

Register copyrights:

In the US, registering your works with the Copyright Office is required before you can sue for infringement and enables you to seek statutory damages and attorney's fees.

Consider licensing:

Some companies (like OpenAI with its Publishers Program) are now offering licensing deals. If your content has significant value, this might be worth exploring.

Collective action

Individual website owners have limited leverage. But:

Trade associations are forming around these issues
Class action lawsuits may be an option
Industry-wide standards are emerging (like news organizations collectively blocking AI crawlers)

The outcomes of major lawsuits will benefit or harm all content creators, not just the plaintiffs.

The future of AI content licensing

The current situation isn't sustainable. Either:

Courts rule training is fair use — AI companies continue as is, creators get nothing
Courts rule against fair use — AI companies must license or face damages
Legislation creates a framework — Compulsory licensing, opt-out registries, etc.

Some predictions:

We'll likely see different outcomes for different content types
Major publishers will get licensing deals; small creators probably won't
Opt-out mechanisms will become more formal
AI companies will develop "cleaner" training data pipelines

What I'd recommend

If you're a website owner concerned about AI scraping:

Block AI crawlers now — It documents non-consent and costs nothing
Update your ToS — Explicitly prohibit AI training use
Follow the lawsuits — The outcomes will determine your options
Don't panic — Your content still has value; this is playing out in slow motion
Consider your leverage — If you have significant unique content, explore licensing

And most importantly: consult an attorney if you're considering legal action or have substantial content at stake. The specifics matter enormously in IP law.

The bigger picture

We're in a pivotal moment for creator rights online. The decisions made in courts over the next few years will shape whether the internet remains a place where content creation is economically viable.

The tools exist to signal your preferences to AI companies—use them. The legal questions are genuinely unsettled—watch how they develop. And if your content has been used without permission, you're not alone—millions of creators are in the same position.

Our robots.txt Generator and .htaccess Generator can help you implement technical blocks against AI crawlers. It takes five minutes and makes your position explicit.

Legal Implications of AI Scraping - What Website Owners Should Know