Ask a thousand people to identify AI-generated text and most will point to the em-dashes. ChatGPT and Claude write like copy editors who read too many literary magazines - which is odd, because the common assumption is that these models learned to write from the internet.
If that were true, you'd expect outputs that sound like casual online writing: short sentences, abbreviations, lowercase everything. Instead you get polished prose with em-dashes, subordinate clauses, and phrases like "It is worth noting" that nobody has actually said out loud since 1987.
The Training Mix Was Never Just the Web
Large language models are trained on web-scraped text, but that's a fraction of the story. The datasets that shaped most major models - filtered Common Crawl subsets, The Pile, and proprietary collections - mix web content with substantial portions of books, Wikipedia, academic papers, legal documents, and published journalism.
Books from Project Gutenberg and digitized archives feature heavily. So does arXiv (academic preprints), legal databases, and writing drawn from established publications. When a model infers what "good writing" looks like, it's learning from sources a professor would assign, not from comment sections. Em-dashes appear constantly in published books and magazine journalism. The model picked up the pattern because that's what it saw associated with authoritative, polished text.
Feedback Training Locked It In
After initial training, models go through RLHF - reinforcement learning from human feedback - where human raters score thousands of outputs for quality and helpfulness. Raters generally preferred responses that sounded competent and polished. Given a choice between something that reads like a Reddit comment and something that reads like a professional writer, the professional writer won.
The model optimized toward formal, edited prose. Em-dashes stayed. Casual contractions got smoothed out. The internet's actual writing habits - fragmented, contradictory, informal - got filtered out as low-quality signal.
The result is a recognizable fingerprint: em-dashes mid-sentence, "let's unpack this," unnaturally even paragraph lengths, and a near-complete absence of the contradictions and tangents that characterize human writing. These aren't random quirks - they're the artifacts of treating formal written English as a quality proxy. Understanding this is useful both for spotting AI-generated content and for knowing what to correct if you want AI-assisted writing that actually sounds like you wrote it.