Evaluating the Usability, Technical Performance, and Accuracy of Artificial Intelligence Scribes for Primary Care: Competitive Analysis.

IF 3 Q2 HEALTH CARE SCIENCES & SERVICES

JMIR Human Factors Pub Date : 2025-07-23 DOI:10.2196/71434

Emily Ha, Isabelle Choon-Kon-Yune, LaShawn Murray, Siying Luan, Enid Montague, Onil Bhattacharyya, Payal Agarwal

{"title":"Evaluating the Usability, Technical Performance, and Accuracy of Artificial Intelligence Scribes for Primary Care: Competitive Analysis.","authors":"Emily Ha, Isabelle Choon-Kon-Yune, LaShawn Murray, Siying Luan, Enid Montague, Onil Bhattacharyya, Payal Agarwal","doi":"10.2196/71434","DOIUrl":null,"url":null,"abstract":"Background: Primary care providers (PCPs) face significant burnout due to increasing administrative and documentation demands, contributing to job dissatisfaction and impacting care quality. Artificial intelligence (AI) scribes have emerged as potential solutions to reduce administrative burden by automating clinical documentation of patient encounters. Although AI scribes are gaining popularity in primary care, there is limited information on their usability, effectiveness, and accuracy.Objective: This study aimed to develop and apply an evaluation framework to systematically assess the usability, technical performance, and accuracy of various AI scribes used in primary care settings across Canada and the United States.Methods: We conducted a systematic comparison of a suite of AI scribes using competitive analysis methods. An evaluation framework was developed using expert usability approaches and human factors engineering principles and comprises 3 domains: usability, effectiveness and technical performance, and accuracy and quality. Audio files from 4 standardized patient encounters were used to generate transcripts and SOAP (Subjective, Objective, Assessment, and Plan)-format medical notes from each AI scribe. A verbatim transcript, detailed case notes, and physician-written medical notes for each audio file served as a benchmark for comparison against the AI-generated outputs. Applicable items were rated on a 3-point Likert scale (1=poor, 2=good, 3=excellent). Additional insights were gathered from clinical experts, vendor questionnaires, and public resources to support usability, effectiveness, and quality findings.Results: In total, 6 AI scribes were evaluated, with notable performance differences. Most AI scribes could be accessed via various platforms (n=4) and launched within common electronic medical records, though data exchange capabilities were limited. Nearly all AI scribes generated SOAP-format notes in approximately 1 minute for a 15-minute standardized encounter (n=5), though documentation time increased with encounter length and topic complexity. While all AI scribes produced good to excellent quality medical notes, none were consistently error-free. Common errors included deletion, omission, and SOAP structure errors. Factors such as extraneous conversations and multiple speakers impacted the accuracy of both the transcript and medical note, with some AI scribes producing excellent notes despite minor transcript issues and vice versa. Limitations in usability, technical performance, and accuracy suggest areas for improvement to fully realize AI scribes' potential in reducing administrative burden for PCPs.Conclusions: This study offers one of the first systematic evaluations of the usability, effectiveness, and accuracy of a suite of AI scribes currently used in primary care, providing benchmark data for further research, policy, and practice. While AI scribes show promise in reducing documentation burdens, improvements and ongoing evaluations are essential to ensure safe and effective use. Future studies should assess AI scribe performance in real-world settings across diverse populations to support equitable and reliable applications.","PeriodicalId":36351,"journal":{"name":"JMIR Human Factors","volume":"12 ","pages":"e71434"},"PeriodicalIF":3.0000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Human Factors","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/71434","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Primary care providers (PCPs) face significant burnout due to increasing administrative and documentation demands, contributing to job dissatisfaction and impacting care quality. Artificial intelligence (AI) scribes have emerged as potential solutions to reduce administrative burden by automating clinical documentation of patient encounters. Although AI scribes are gaining popularity in primary care, there is limited information on their usability, effectiveness, and accuracy.

Objective: This study aimed to develop and apply an evaluation framework to systematically assess the usability, technical performance, and accuracy of various AI scribes used in primary care settings across Canada and the United States.

Methods: We conducted a systematic comparison of a suite of AI scribes using competitive analysis methods. An evaluation framework was developed using expert usability approaches and human factors engineering principles and comprises 3 domains: usability, effectiveness and technical performance, and accuracy and quality. Audio files from 4 standardized patient encounters were used to generate transcripts and SOAP (Subjective, Objective, Assessment, and Plan)-format medical notes from each AI scribe. A verbatim transcript, detailed case notes, and physician-written medical notes for each audio file served as a benchmark for comparison against the AI-generated outputs. Applicable items were rated on a 3-point Likert scale (1=poor, 2=good, 3=excellent). Additional insights were gathered from clinical experts, vendor questionnaires, and public resources to support usability, effectiveness, and quality findings.

Results: In total, 6 AI scribes were evaluated, with notable performance differences. Most AI scribes could be accessed via various platforms (n=4) and launched within common electronic medical records, though data exchange capabilities were limited. Nearly all AI scribes generated SOAP-format notes in approximately 1 minute for a 15-minute standardized encounter (n=5), though documentation time increased with encounter length and topic complexity. While all AI scribes produced good to excellent quality medical notes, none were consistently error-free. Common errors included deletion, omission, and SOAP structure errors. Factors such as extraneous conversations and multiple speakers impacted the accuracy of both the transcript and medical note, with some AI scribes producing excellent notes despite minor transcript issues and vice versa. Limitations in usability, technical performance, and accuracy suggest areas for improvement to fully realize AI scribes' potential in reducing administrative burden for PCPs.

Conclusions: This study offers one of the first systematic evaluations of the usability, effectiveness, and accuracy of a suite of AI scribes currently used in primary care, providing benchmark data for further research, policy, and practice. While AI scribes show promise in reducing documentation burdens, improvements and ongoing evaluations are essential to ensure safe and effective use. Future studies should assess AI scribe performance in real-world settings across diverse populations to support equitable and reliable applications.

查看原文本刊更多论文

评估初级保健人工智能抄写员的可用性、技术性能和准确性：竞争分析。

背景：由于行政和文件需求的增加，初级保健提供者（pcp）面临着严重的职业倦怠，导致工作不满并影响护理质量。人工智能（AI）抄写员已经成为减少管理负担的潜在解决方案，可以自动记录患者的临床情况。尽管人工智能抄写员在初级保健中越来越受欢迎，但关于它们的可用性、有效性和准确性的信息有限。目的：本研究旨在开发和应用一个评估框架，系统地评估加拿大和美国初级保健机构中使用的各种人工智能记录仪的可用性、技术性能和准确性。方法：我们使用竞争分析方法对一套人工智能抄写器进行了系统比较。利用专家可用性方法和人因工程原理开发了一个评估框架，包括3个领域：可用性、有效性和技术性能、准确性和质量。来自4个标准化患者就诊的音频文件用于生成每个AI抄写员的转录本和SOAP（主观、客观、评估和计划）格式的医疗记录。每个音频文件的逐字记录、详细的病例笔记和医生撰写的医疗笔记作为与人工智能生成的输出进行比较的基准。适用的项目以3分的李克特量表（1=差，2=好，3=优秀）进行评分。从临床专家、供应商问卷和公共资源中收集了更多的见解，以支持可用性、有效性和质量发现。结果：共对6名人工智能抄写员进行了评估，成绩差异显著。虽然数据交换能力有限，但大多数人工智能抄写器可以通过各种平台访问（n=4），并在普通电子病历中启动。几乎所有的AI抄写员在大约1分钟内为15分钟的标准化会议（n=5）生成soap格式的笔记，尽管记录时间随着会议长度和主题复杂性的增加而增加。虽然所有的人工智能抄写员都能写出高质量的医疗记录，但没有一个是始终没有错误的。常见的错误包括删除、遗漏和SOAP结构错误。无关的谈话和多个说话者等因素影响了笔录和医疗记录的准确性，一些人工智能抄写员尽管有轻微的笔录问题，但仍能写出优秀的笔记，反之亦然。可用性、技术性能和准确性方面的限制表明，要充分实现人工智能抄写员在减轻pcp管理负担方面的潜力，还需要改进。结论：本研究首次对目前用于初级保健的一套人工智能记录仪的可用性、有效性和准确性进行了系统评估，为进一步的研究、政策和实践提供了基准数据。虽然人工智能抄写员有望减轻文档负担，但为了确保安全有效地使用，改进和持续评估是必不可少的。未来的研究应该评估人工智能在现实世界中不同人群的表现，以支持公平可靠的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊