Clinical AI scribes - tools that listen to doctor-patient conversations and write up the notes afterward - have been adopted across North American health systems as a way to cut the hours physicians spend on documentation each day. An Ontario government audit has now put specific failures on record: the tool deployed for use by provincial doctors was hallucinating, producing details in clinical notes that were never actually spoken during appointments, CBC News reported.
The auditor's findings go beyond transcription errors, which are expected at some rate from any automated system. Hallucination in AI means the model generated plausible-sounding text that didn't exist in the source material - not a misheard word, but an invented one. In clinical notes, that distinction matters significantly. A transcription error might render "40mg" as "400mg." A hallucination might add a symptom the patient never mentioned.
Why clinical AI scribes fail differently than other transcription tools
Most transcription tools - including general-purpose options like AssemblyAI - convert spoken audio to text and measure accuracy based on how closely the output matches what was said. Clinical AI scribes do something harder: they interpret, organize, and summarize a conversation into structured medical notes in a specific format. A doctor doesn't want a verbatim transcript of a 10-minute appointment; they want a formatted SOAP note (Subjective, Objective, Assessment, Plan) that captures the clinically relevant information.
That summarization step is where hallucinations enter. The AI model has to decide what to include, how to frame it, and how to fill the structural requirements of a clinical note. When it's uncertain, it generates - and in a medical record, a generated detail that sounds plausible can go unchecked for years.
The deployment problem behind the finding
The appeal of these tools is real. Physicians routinely spend one to three hours per day on documentation. AI scribes that reduce that burden have shown measurable benefits in physician availability and burnout, and the pressure to adopt them isn't irrational.
But the Ontario finding points to a recurring pattern in healthcare AI deployment: tools get adopted based on general capability benchmarks, then used in settings where the specific error profile hasn't been adequately measured. A tool achieving 97% accuracy on general speech sounds solid until you calculate that 3% applied across 30 patient appointments per day means roughly one error per day entering the medical records system - and that figure doesn't account for hallucination, which operates differently from straightforward transcription mistakes.
The questions any healthcare practice should be asking before deploying an AI scribe: What is the false generation rate specifically for clinical terminology? How does the tool handle ambiguous audio or cross-talk between speakers? When the model is uncertain, does it flag that uncertainty or fill the gap silently? And does the actual workflow include a physician review step that catches AI errors, or has the tool been deployed in a way that bypasses that check in practice?
The Ontario auditor's finding won't stop AI adoption in clinical settings - the documentation burden is too real for that. But it should raise the bar for what "good enough" means before an AI writes anything that enters a patient's permanent medical record.