Related ToolsChatgptClaudeGemini

Study of 1M Domains Finds 90% Have Zero AI Agent Permissions Policy

AI news: Study of 1M Domains Finds 90% Have Zero AI Agent Permissions Policy

Nine out of ten websites have no machine-readable policy telling AI agents what they can and cannot do. That's the headline finding from Maango's State of AI Agent Policies 2026 report, which crawled 999,316 domains from the Tranco Top 1M list during February 2026.

The timing matters. AI agents are no longer just scraping web pages for training data. They're browsing, summarizing, purchasing, and taking actions on behalf of users. And the vast majority of the web has said nothing - in any machine-readable format - about whether that's allowed.

Eight Standards, No Consensus

Part of the problem is fragmentation. There are currently eight competing ways for a website to communicate AI permissions: robots.txt AI directives, llms.txt, ai.txt, TDMRep, Cloudflare Content Signals, Markdown for Agents, meta tags, and Terms of Service language. Only 2.6% of domains use more than one.

Of the newer standards, llms.txt leads adoption at 3.24% of domains. Cloudflare Content Signals sits at 3.48%, but that number is inflated by platform defaults rather than deliberate choices. TDMRep, specifically built for EU Copyright Directive compliance, has been adopted by just 0.004% of domains - essentially nobody.

The report's blunt assessment: "Eight competing standards is not a governance framework. It's a fragmentation problem."

Who's Blocking and Why

Among domains that do have policies, GPTBot is the most-blocked AI crawler at 6.9%, followed by ClaudeBot at 6.1%, Amazonbot at 6.0%, and Google-Extended at 5.9%. About 58,800 domains block both GPTBot and ClaudeBot simultaneously, suggesting most site owners are doing blanket AI blocking rather than making targeted decisions about specific agents.

The most interesting pattern is infrastructure-driven. Cloudflare-hosted sites block AI agents at 11.3%, more than double the baseline, because Cloudflare offers a one-click blocking toggle. Vercel sites block at just 1.3%, Netlify at 0.7%. When blocking is easy, people block. When it requires manual configuration, they mostly don't bother.

Top-1,000 sites are 1.8x more likely to block AI agents than average, with ESPN, CNBC, LA Times, and BBC among the most restrictive. Meanwhile Google, YouTube, Microsoft, Apple, and Wikipedia have no AI-specific machine-readable policies at all.

The Terms of Service Trap

The most alarming finding involves what the report calls the ToS disconnect. Of 79,173 domains with discoverable Terms of Service pages, 7,481 prohibit crawling and 1,377 specifically prohibit AI training. But 85.5% of these domains have no AI-specific robots.txt rules backing up those legal restrictions.

That creates a trap for AI agent developers. An agent that checks robots.txt - the standard technical mechanism - sees no restriction. But the site's legal terms explicitly prohibit the activity. YouTube, Discord, Target, and Substack all fall into this gap.

Another 6,317 domains contain outright contradictions between their different policy signals, like blocking GPTBot in robots.txt while simultaneously setting "search=yes" in Content Signals.

Geographically, the UK blocks at 4.4% (highest among major markets), while Japan (1.1%) and Iran (0.9%) are most permissive. European domains trend more restrictive, likely reflecting GDPR culture and the EU Copyright Directive.

What This Means for AI Tool Users

For anyone building or using AI agents that interact with the web, this report reveals a governance vacuum. Most websites haven't said yes or no to AI agents because the tooling to express preferences is fragmented and confusing. The 2.3x jump in blocking rates on Cloudflare (where it's one click) versus other platforms proves that friction, not intent, is the primary driver of non-adoption.

Until the industry consolidates around fewer standards and makes policy-setting as easy as flipping a switch, the web will remain a legal gray zone for AI agents. And the 7,500+ domains with ToS restrictions but no technical enforcement represent real liability for agent operators who assume robots.txt tells the whole story.