37,000 AI-generated fake podcasts, now publicly documented. Listen Notes - the company that operates a podcast search engine indexing over 4 million shows - published the dataset on Kaggle this week, making it the largest labeled collection of synthetic podcast spam available to researchers and platform developers.
What the Dataset Actually Contains
These aren't experimental shows or AI-assisted legitimate content. They're designed to occupy real estate in podcast directories: synthetic narration, keyword-stuffed titles, auto-generated descriptions, and no actual human audience behind them. Listen Notes identified them through audio pattern analysis and metadata signals, then packaged everything - episode audio files, RSS feed data, show metadata - into a single research benchmark.
The conditions for podcast spam have existed for years. Hosting is cheap or free, directory submission requires no verification, and text-to-speech synthesis has improved to the point where synthetic voices pass casual listening tests. The same dynamics that created email spam and social media bot farms are playing out in podcast directories, just more slowly because the economics took longer to tip.
Platforms Haven't Caught Up
Apple Podcasts, Spotify, and the major directories have invested very little in content moderation compared to video or social platforms. Spotify has added synthetic music detection for some track types, but fake audio podcasts sit in a different category and largely slip through. The podcast ecosystem's open RSS architecture - one of its genuine strengths for independent creators - also makes enforcement difficult. Anyone can create a valid feed and submit it.
For real podcast creators, this shows up as noise rather than crisis: more irrelevant competition in search results, murkier download data, and directory recommendations that include obvious filler. For advertisers buying podcast placements, there's no reliable way to verify whether a niche show's numbers reflect real listeners.
What Listen Notes did by publishing this dataset publicly is hand the research community a labeled training set. Building a classifier that reliably distinguishes synthetic from authentic audio is a solvable problem - but it requires real-world examples of both. This is that data. Whether platform operators fund the actual work to use it depends on how much pressure they feel from creators and advertisers. Based on their track record on content moderation, that pressure hasn't arrived yet.