Related ToolsChatgptClaude

The New York Times Is Blocking the Internet Archive, and It Won't Slow AI One Bit

AI news: The New York Times Is Blocking the Internet Archive, and It Won't Slow AI One Bit

The New York Times has started blocking the Internet Archive from crawling its website. The Guardian appears to be doing the same. The stated reason: fears about AI companies scraping their content. But according to the Electronic Frontier Foundation, these publishers are punishing the wrong target and destroying something irreplaceable in the process.

The Internet Archive's Wayback Machine holds over one trillion archived web pages built up over nearly thirty years. Wikipedia links to more than 2.6 million news articles preserved there across 249 languages. When a news organization edits a story, retracts a claim, or quietly deletes an article, the Archive is often the only place that original version still exists.

None of that has anything to do with training large language models.

The Archive Isn't Building AI Products

The EFF's Joe Mullin makes a simple point that publishers seem to be ignoring: the Internet Archive is a nonprofit. It doesn't train commercial AI systems. It doesn't sell data to companies that do. It preserves web pages so the public can access them later, the same way a library preserves books.

Publishers have legitimate concerns about AI companies using their journalism without permission or payment. But blocking a nonprofit archivist does nothing to address that problem. OpenAI, Google, and Anthropic have their own crawlers, their own data pipelines, and their own legal teams. They don't need the Wayback Machine.

What blocking the Archive actually does is ensure that when the Times changes a headline, removes a paragraph, or takes down a story entirely, there's no independent record of what was originally published. That's a loss for researchers, journalists, historians, and anyone who thinks accountability matters.

Courts Have Already Weighed In on This

The legal landscape here isn't as murky as publishers suggest. U.S. courts have consistently found that making material searchable qualifies as fair use. The landmark Google Books case established that copying works to create searchable databases is "clear fair use" because it serves a transformative purpose.

The EFF argues that the Archive's crawling falls squarely into this category. It's not republishing articles to compete with the Times. It's preserving snapshots of the web as it existed at a specific moment, a function closer to a library catalog than a content farm.

The AI copyright question is genuinely unresolved. Whether training models on copyrighted text counts as fair use will likely take years of litigation to settle. But conflating that fight with web archiving is either confused thinking or a convenient excuse to exert broader control over how published content gets used after the fact.

What Actually Gets Lost

This isn't abstract. When a major publisher blocks the Archive, specific things disappear:

  • Original versions of stories that were later corrected or updated with no public record of what changed
  • Articles that get removed entirely during legal disputes or editorial reversals
  • The ability for researchers to study how news coverage of events evolved over time
  • Citation links from Wikipedia articles that suddenly point nowhere useful

The web is already fragile. Studies regularly find that a significant percentage of links from even five years ago are dead. The Internet Archive is one of the few organizations actively fighting that decay, and it's been doing it since 1996.

Publishers who want to fight AI scraping have real options available: licensing deals, lawsuits against specific AI companies, technical measures targeted at known AI crawlers. Blocking a nonprofit archivist isn't strategy. It's collateral damage dressed up as copyright enforcement.