Policy

Most Websites Are Training AI Models for Free, and Few Have a Plan to Stop It

March 21, 2026 3 min read

Every time you publish a blog post, product page, or tutorial, there's a decent chance an AI crawler has already indexed it. OpenAI, Google, Anthropic, Meta, and dozens of smaller players routinely scrape the open web to feed their training pipelines. The content creators who wrote that material? They get nothing.

This isn't a new problem, but the scale has changed. AI training datasets now run into the trillions of tokens. Common Crawl, the open dataset that many models use as a starting point, contains petabytes of web data pulled from billions of URLs. Your company's help docs, your Medium posts, your carefully researched buying guides - they're all in there, blended into model weights that generate billions in revenue.

The robots.txt Fiction

The standard defense is robots.txt, the decades-old text file that tells crawlers which pages to skip. In practice, it's a suggestion, not a wall. Several AI companies have been caught ignoring robots.txt directives entirely. Others use crawlers with user-agent strings that don't match any published opt-out list. Even when companies do respect robots.txt, opting out means your content disappears from AI-powered search results too, which increasingly means disappearing from the internet.

Google's AI Overviews now appear on roughly 30% of search queries. If you block Google's AI crawler, you risk losing visibility in those summaries. It's a lose-lose: let them train on your content for free, or become invisible.

Who's Actually Getting Paid

A handful of large publishers have cut licensing deals. The Associated Press signed with OpenAI. News Corp got a reported $250 million deal. Reddit licensed its data to Google for $60 million annually. But these deals go to organizations with legal teams and negotiating power. The independent blogger, the niche SaaS company writing product comparisons, the freelance journalist - none of them have a seat at that table.

Some startups are trying to fill the gap. Services that watermark content, track AI usage, or negotiate collective licensing deals are starting to appear. None have reached meaningful scale yet. The fundamental problem is that once your text is in a training dataset, there's no practical way to prove it influenced any specific output or to claw back compensation after the fact.

What You Can Actually Do Today

The honest answer is: not much, at least not without trade-offs. You can add ai.txt or update robots.txt to block known AI crawlers. You can put content behind authentication walls, though that kills organic traffic. You can join class-action lawsuits, several of which are working through US courts right now, but don't expect a check anytime soon.

The more interesting question is whether this dynamic is sustainable. AI companies need fresh, high-quality web content to keep their models current. If creators stop publishing openly because there's no financial incentive, the training data pipeline dries up. We're not there yet, but the tension is real and growing.

For now, if you run a content-heavy site, at minimum audit which AI crawlers are hitting your pages. Check your server logs for GPTBot, ClaudeBot, Google-Extended, and similar user agents. Knowing the scope of the problem is step one. Getting paid for it is still step zero.

The robots.txt Fiction

Who's Actually Getting Paid

What You Can Actually Do Today

Related Tools

More from today

AI Music Fraud Hit $8M Before Anyone Noticed. The Real Number Is $2B.

Man Pleads Guilty to $8M Streaming Fraud Using AI-Generated Music

Hachette Cancels Horror Novel 'Shy Girl' Over AI Writing Allegations

Cookie Preferences