The use of generative artificial intelligence-based dictation in a neurosurgical practice: a pilot study.

IF 3 2区医学 Q2 CLINICAL NEUROLOGY

Neurosurgical focus Pub Date : 2025-07-01 DOI:10.3171/2025.4.FOCUS24834

Benjamin S Hopkins, Jonathan Dallas, James Yu, Robert G Briggs, Lawrance K Chung, David J Cote, David Gomez, Ishan Shah, John D Carmichael, John C Liu, William J Mack, Gabriel Zada

{"title":"The use of generative artificial intelligence-based dictation in a neurosurgical practice: a pilot study.","authors":"Benjamin S Hopkins, Jonathan Dallas, James Yu, Robert G Briggs, Lawrance K Chung, David J Cote, David Gomez, Ishan Shah, John D Carmichael, John C Liu, William J Mack, Gabriel Zada","doi":"10.3171/2025.4.FOCUS24834","DOIUrl":null,"url":null,"abstract":"Objective: Document dictation remains a significant clinical burden and generative artificial intelligence (AI) systems utilizing transformer-based technology offer efficient speech processing methods that could streamline clinical documentation. This study aimed to evaluate the potential of generative AI in enhancing dictation efficiency and workflow within a targeted neurosurgical practice.Methods: Ten operative reports from both cranial and spinal neurosurgical procedures were dictated and recorded by three independent physicians. The audio files were processed by 1) a modified speech-to-text model implemented based on a backbone architecture created by OpenAI's Whisper model and 2) Nuance's Dragon Medical One as a comparative commercial standard. Word error rate (WER) was manually reviewed.Results: The mean WER was 1.75% for Whisper and 1.54% for Dragon (p = 0.080). When excluding linguistic errors, Whisper outperformed Dragon with a mean WER of 0.50% versus 1.34% (p < 0.001), including the mean number of total errors (Whisper: 6.1, Dragon: 9.7; p = 0.002). For all unstratified dictations, a positive correlation was seen between total errors and word count (p < 0.001, R2 = 0.37), as well as total errors and recording length (p < 0.001, R2 = 0.22). A positive correlation was noted between words spoken per second and total errors for Dragon (p = 0.020, R2 = 0.18), but not for Whisper (p = 0.205, R2 = 0.06). Similarly, when analyzing linguistic errors only, this trend held for Dragon (p = 0.014, R2 = 0.20), but not for Whisper (p = 0.331, R2 = 0.03).Conclusions: An AI-based model performed at a noninferior rate compared to a commercially available speech-to-text dictation program. Generative models provide potential benefits such as contextual inference that show promise in limiting errors with increased dictation speed or adjustment for impure input data.","PeriodicalId":19187,"journal":{"name":"Neurosurgical focus","volume":"59 1","pages":"E8"},"PeriodicalIF":3.0000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurosurgical focus","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3171/2025.4.FOCUS24834","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: Document dictation remains a significant clinical burden and generative artificial intelligence (AI) systems utilizing transformer-based technology offer efficient speech processing methods that could streamline clinical documentation. This study aimed to evaluate the potential of generative AI in enhancing dictation efficiency and workflow within a targeted neurosurgical practice.

Methods: Ten operative reports from both cranial and spinal neurosurgical procedures were dictated and recorded by three independent physicians. The audio files were processed by 1) a modified speech-to-text model implemented based on a backbone architecture created by OpenAI's Whisper model and 2) Nuance's Dragon Medical One as a comparative commercial standard. Word error rate (WER) was manually reviewed.

Results: The mean WER was 1.75% for Whisper and 1.54% for Dragon (p = 0.080). When excluding linguistic errors, Whisper outperformed Dragon with a mean WER of 0.50% versus 1.34% (p < 0.001), including the mean number of total errors (Whisper: 6.1, Dragon: 9.7; p = 0.002). For all unstratified dictations, a positive correlation was seen between total errors and word count (p < 0.001, R2 = 0.37), as well as total errors and recording length (p < 0.001, R2 = 0.22). A positive correlation was noted between words spoken per second and total errors for Dragon (p = 0.020, R2 = 0.18), but not for Whisper (p = 0.205, R2 = 0.06). Similarly, when analyzing linguistic errors only, this trend held for Dragon (p = 0.014, R2 = 0.20), but not for Whisper (p = 0.331, R2 = 0.03).

Conclusions: An AI-based model performed at a noninferior rate compared to a commercially available speech-to-text dictation program. Generative models provide potential benefits such as contextual inference that show promise in limiting errors with increased dictation speed or adjustment for impure input data.

查看原文本刊更多论文

基于生成式人工智能的听写在神经外科实践中的应用：一项试点研究。

目的：文档口述仍然是一个重要的临床负担，利用基于变压器技术的生成式人工智能（AI）系统提供了有效的语音处理方法，可以简化临床文档。本研究旨在评估生成式人工智能在提高目标神经外科实践中的听写效率和工作流程方面的潜力。方法：由3名独立医师口述并记录10例颅、脊神经外科手术报告。音频文件通过1)基于OpenAI的Whisper模型和2)Nuance的Dragon Medical One（作为比较的商业标准）创建的骨干架构实现的修改后的语音到文本模型进行处理。手动检查单词错误率（WER）。结果：Whisper和Dragon的平均WER分别为1.75%和1.54% （p = 0.080）。当排除语言错误时，Whisper的平均WER为0.50%，优于Dragon的1.34% (p < 0.001)，包括平均总错误数(Whisper: 6.1, Dragon: 9.7；P = 0.002)。对于所有非分层听写，总错误与字数（p < 0.001, R2 = 0.37）以及总错误与记录长度（p < 0.001, R2 = 0.22）呈正相关。每秒钟说的字数与“龙”的总错误之间存在正相关（p = 0.020, R2 = 0.18），而“耳语”则不存在正相关（p = 0.205, R2 = 0.06）。同样，当只分析语言错误时，这一趋势适用于Dragon (p = 0.014, R2 = 0.20)，但不适用于Whisper （p = 0.331, R2 = 0.03）。结论：与商业上可用的语音到文本听写程序相比，基于人工智能的模型的执行速度并不逊色。生成模型提供了潜在的好处，例如上下文推理，它有望通过提高听写速度或调整不纯输入数据来限制错误。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊