The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.

IF 5.8 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Internet Research Pub Date : 2024-11-05 DOI:10.2196/56532

William J Waldock, Joe Zhang, Ahmad Guni, Ahmad Nabeel, Ara Darzi, Hutan Ashrafian

{"title":"The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.","authors":"William J Waldock, Joe Zhang, Ahmad Guni, Ahmad Nabeel, Ara Darzi, Hutan Ashrafian","doi":"10.2196/56532","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text. However, there is a lack of clarity about the accuracy and capability standards of LLMs in health care examinations.Objective: We conducted a systematic review of LLM accuracy, as tested under health care examination conditions, as compared to known human performance standards.Methods: We quantified the accuracy of LLMs in responding to health care examination questions and evaluated the consistency and quality of study reporting. The search included all papers up until September 10, 2023, with all LLMs published in English journals that report clear LLM accuracy standards. The exclusion criteria were as follows: the assessment was not a health care exam, there was no LLM, there was no evaluation of comparable success accuracy, and the literature was not original research.The literature search included the following Medical Subject Headings (MeSH) terms used in all possible combinations: \"artificial intelligence,\" \"ChatGPT,\" \"GPT,\" \"LLM,\" \"large language model,\" \"machine learning,\" \"neural network,\" \"Generative Pre-trained Transformer,\" \"Generative Transformer,\" \"Generative Language Model,\" \"Generative Model,\" \"medical exam,\" \"healthcare exam,\" and \"clinical exam.\" Sensitivity, accuracy, and precision data were extracted, including relevant CIs.Results: The search identified 1673 relevant citations. After removing duplicate results, 1268 (75.8%) papers were screened for titles and abstracts, and 32 (2.5%) studies were included for full-text review. Our meta-analysis suggested that LLMs are able to perform with an overall medical examination accuracy of 0.61 (CI 0.58-0.64) and a United States Medical Licensing Examination (USMLE) accuracy of 0.51 (CI 0.46-0.56), while Chat Generative Pretrained Transformer (ChatGPT) can perform with an overall medical examination accuracy of 0.64 (CI 0.6-0.67).Conclusions: LLMs offer promise to remediate health care demand and staffing challenges by providing accurate and efficient context-specific information to critical decision makers. For policy and deployment decisions about LLMs to advance health care, we proposed a new framework called RUBRICC (Regulatory, Usability, Bias, Reliability [Evidence and Safety], Interoperability, Cost, and Codesign-Patient and Public Involvement and Engagement [PPIE]). This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services, while respecting patient safety considerations.Trial registration: OSF Registries osf.io/xqzkw; https://osf.io/xqzkw.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"26 ","pages":"e56532"},"PeriodicalIF":5.8000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11576595/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/56532","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text. However, there is a lack of clarity about the accuracy and capability standards of LLMs in health care examinations.

Objective: We conducted a systematic review of LLM accuracy, as tested under health care examination conditions, as compared to known human performance standards.

Methods: We quantified the accuracy of LLMs in responding to health care examination questions and evaluated the consistency and quality of study reporting. The search included all papers up until September 10, 2023, with all LLMs published in English journals that report clear LLM accuracy standards. The exclusion criteria were as follows: the assessment was not a health care exam, there was no LLM, there was no evaluation of comparable success accuracy, and the literature was not original research.The literature search included the following Medical Subject Headings (MeSH) terms used in all possible combinations: "artificial intelligence," "ChatGPT," "GPT," "LLM," "large language model," "machine learning," "neural network," "Generative Pre-trained Transformer," "Generative Transformer," "Generative Language Model," "Generative Model," "medical exam," "healthcare exam," and "clinical exam." Sensitivity, accuracy, and precision data were extracted, including relevant CIs.

Results: The search identified 1673 relevant citations. After removing duplicate results, 1268 (75.8%) papers were screened for titles and abstracts, and 32 (2.5%) studies were included for full-text review. Our meta-analysis suggested that LLMs are able to perform with an overall medical examination accuracy of 0.61 (CI 0.58-0.64) and a United States Medical Licensing Examination (USMLE) accuracy of 0.51 (CI 0.46-0.56), while Chat Generative Pretrained Transformer (ChatGPT) can perform with an overall medical examination accuracy of 0.64 (CI 0.6-0.67).

Conclusions: LLMs offer promise to remediate health care demand and staffing challenges by providing accurate and efficient context-specific information to critical decision makers. For policy and deployment decisions about LLMs to advance health care, we proposed a new framework called RUBRICC (Regulatory, Usability, Bias, Reliability [Evidence and Safety], Interoperability, Cost, and Codesign-Patient and Public Involvement and Engagement [PPIE]). This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services, while respecting patient safety considerations.

Trial registration: OSF Registries osf.io/xqzkw; https://osf.io/xqzkw.

查看原文本刊更多论文

人工智能解决方案在医疗检查和证书中的准确性和能力：系统回顾与元分析》。

背景：大语言模型（LLMs）因其在叙事文本中准确复制所学知识的明显能力而备受公众关注。然而，目前还不清楚 LLM 在医疗保健考试中的准确性和能力标准：我们对在医疗保健考试条件下测试的 LLM 准确性进行了系统回顾，并与已知的人类性能标准进行了比较：我们量化了法律硕士回答医疗考试问题的准确性，并评估了研究报告的一致性和质量。检索包括截至 2023 年 9 月 10 日的所有论文，所有在英文期刊上发表的 LLMs 都报告了明确的 LLM 准确性标准。排除标准如下：评估不是医疗保健考试，没有 LLM，没有可比较的成功准确性评估，文献不是原创性研究："人工智能"、"ChatGPT"、"GPT"、"LLM"、"大型语言模型"、"机器学习"、"神经网络"、"生成式预训练转换器"、"生成式转换器"、"生成式语言模型"、"生成式模型"、"医学检查"、"医疗保健检查 "和 "临床检查"。提取了灵敏度、准确度和精确度数据，包括相关的CIs：搜索发现了 1673 篇相关引文。去除重复结果后，对 1268 篇（75.8%）论文的标题和摘要进行了筛选，纳入 32 篇（2.5%）研究进行全文审阅。我们的荟萃分析表明，LLM 的总体医学考试准确率为 0.61（CI 0.58-0.64），美国医学执照考试（USMLE）准确率为 0.51（CI 0.46-0.56），而 Chat Generative Pretrained Transformer（ChatGPT）的总体医学考试准确率为 0.64（CI 0.6-0.67）：LLM 可为关键决策者提供准确、高效的特定背景信息，从而有望解决医疗需求和人员配置方面的难题。为了制定有关 LLMs 的政策和部署决策以推动医疗保健事业的发展，我们提出了一个名为 RUBRICC（监管、可用性、偏差、可靠性[证据和安全性]、互操作性、成本和代码设计--患者和公众参与[PPIE]）的新框架。这提供了一个宝贵的机会，在尊重患者安全考虑的同时，指导临床委托将新的 LLM 功能纳入医疗服务：OSF Registries osf.io/xqzkw；https://osf.io/xqzkw。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.