人工智能大型语言模型(Copilot和Gemini)在医疗保健政策制定方面与人类专家的性能比较:一项混合方法横断面研究。

IF 2.3 3区 医学 Q2 HEALTH CARE SCIENCES & SERVICES
Health Informatics Journal Pub Date : 2025-07-01 Epub Date: 2025-09-22 DOI:10.1177/14604582251381269
Mohsen Khosravi, Reyhane Izadi, Mina Aghamaleki Sarvestani, Hossein Bouzarjomehri, Milad Ahmadi Marzaleh, Ramin Ravangard
{"title":"人工智能大型语言模型(Copilot和Gemini)在医疗保健政策制定方面与人类专家的性能比较:一项混合方法横断面研究。","authors":"Mohsen Khosravi, Reyhane Izadi, Mina Aghamaleki Sarvestani, Hossein Bouzarjomehri, Milad Ahmadi Marzaleh, Ramin Ravangard","doi":"10.1177/14604582251381269","DOIUrl":null,"url":null,"abstract":"<p><p>ObjectiveThis study aimed to assess the performance of Artificial Intelligence (AI) compared to human experts in healthcare policymaking.MethodsThis was a mixed-methods cross-sectional study conducted in Iran during the years 2024-2025, comparing, and analyzing the responses of multiple AI Large Language Models (LLMs) including Bing AI Copilot and Gemini and a sample of 15 human experts-using confusion matrix analysis. This analysis provided comprehensive data on the respondents' ability to answer context-specific questions regarding healthcare policy making, evaluated through multiple parameters including sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and overall accuracy.ResultsCopilot demonstrated a sensitivity of 0.867, specificity of 0, PPV of 0.722, NPV of 0, and accuracy of 0.65. In comparison, Gemini exhibited a sensitivity of 0.733, specificity of 0.4, PPV of 0.786, NPV of 0.333, and also an accuracy of 0.65. Additionally, the human experts' responses indicated a sensitivity of 0.5808, specificity of 0.2571, PPV of 0.7189, NPV of 0.1579, and an accuracy of 0.5050.ConclusionThe AI LLMs outperformed human experts in responding to the study questionnaire. The findings demonstrated the considerable potential of the LLMs in enhancing healthcare policy-making, particularly by serving as complementary tools and collaborators alongside humans.</p>","PeriodicalId":55069,"journal":{"name":"Health Informatics Journal","volume":"31 3","pages":"14604582251381269"},"PeriodicalIF":2.3000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of artificial intelligence large language models (Copilot and Gemini) compared to human experts in healthcare policy making: A mixed-methods cross-sectional study.\",\"authors\":\"Mohsen Khosravi, Reyhane Izadi, Mina Aghamaleki Sarvestani, Hossein Bouzarjomehri, Milad Ahmadi Marzaleh, Ramin Ravangard\",\"doi\":\"10.1177/14604582251381269\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>ObjectiveThis study aimed to assess the performance of Artificial Intelligence (AI) compared to human experts in healthcare policymaking.MethodsThis was a mixed-methods cross-sectional study conducted in Iran during the years 2024-2025, comparing, and analyzing the responses of multiple AI Large Language Models (LLMs) including Bing AI Copilot and Gemini and a sample of 15 human experts-using confusion matrix analysis. This analysis provided comprehensive data on the respondents' ability to answer context-specific questions regarding healthcare policy making, evaluated through multiple parameters including sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and overall accuracy.ResultsCopilot demonstrated a sensitivity of 0.867, specificity of 0, PPV of 0.722, NPV of 0, and accuracy of 0.65. In comparison, Gemini exhibited a sensitivity of 0.733, specificity of 0.4, PPV of 0.786, NPV of 0.333, and also an accuracy of 0.65. Additionally, the human experts' responses indicated a sensitivity of 0.5808, specificity of 0.2571, PPV of 0.7189, NPV of 0.1579, and an accuracy of 0.5050.ConclusionThe AI LLMs outperformed human experts in responding to the study questionnaire. The findings demonstrated the considerable potential of the LLMs in enhancing healthcare policy-making, particularly by serving as complementary tools and collaborators alongside humans.</p>\",\"PeriodicalId\":55069,\"journal\":{\"name\":\"Health Informatics Journal\",\"volume\":\"31 3\",\"pages\":\"14604582251381269\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Health Informatics Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/14604582251381269\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/9/22 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Informatics Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/14604582251381269","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/22 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

目的本研究旨在评估人工智能(AI)与人类专家在医疗保健决策中的表现。方法:这是一项混合方法的横断面研究,于2024-2025年在伊朗进行,使用混淆矩阵分析,比较和分析了包括Bing AI Copilot和Gemini在内的多个AI大型语言模型(llm)和15名人类专家的反应。该分析提供了关于受访者回答有关医疗保健政策制定的特定情境问题的能力的综合数据,通过多个参数进行评估,包括敏感性、特异性、阴性预测值(NPV)、阳性预测值(PPV)和总体准确性。结果scopilot的敏感性为0.867,特异性为0,PPV为0.722,NPV为0,准确率为0.65。相比之下,Gemini的敏感性为0.733,特异性为0.4,PPV为0.786,NPV为0.333,准确性为0.65。此外,人类专家的反应灵敏度为0.5808,特异性为0.2571,PPV为0.7189,NPV为0.1579,准确性为0.5050。结论人工智能法学硕士在回答研究问卷方面优于人类专家。研究结果表明,法学硕士在加强医疗保健决策方面具有相当大的潜力,特别是作为人类的补充工具和合作者。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Performance of artificial intelligence large language models (Copilot and Gemini) compared to human experts in healthcare policy making: A mixed-methods cross-sectional study.

ObjectiveThis study aimed to assess the performance of Artificial Intelligence (AI) compared to human experts in healthcare policymaking.MethodsThis was a mixed-methods cross-sectional study conducted in Iran during the years 2024-2025, comparing, and analyzing the responses of multiple AI Large Language Models (LLMs) including Bing AI Copilot and Gemini and a sample of 15 human experts-using confusion matrix analysis. This analysis provided comprehensive data on the respondents' ability to answer context-specific questions regarding healthcare policy making, evaluated through multiple parameters including sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and overall accuracy.ResultsCopilot demonstrated a sensitivity of 0.867, specificity of 0, PPV of 0.722, NPV of 0, and accuracy of 0.65. In comparison, Gemini exhibited a sensitivity of 0.733, specificity of 0.4, PPV of 0.786, NPV of 0.333, and also an accuracy of 0.65. Additionally, the human experts' responses indicated a sensitivity of 0.5808, specificity of 0.2571, PPV of 0.7189, NPV of 0.1579, and an accuracy of 0.5050.ConclusionThe AI LLMs outperformed human experts in responding to the study questionnaire. The findings demonstrated the considerable potential of the LLMs in enhancing healthcare policy-making, particularly by serving as complementary tools and collaborators alongside humans.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Health Informatics Journal
Health Informatics Journal HEALTH CARE SCIENCES & SERVICES-MEDICAL INFORMATICS
CiteScore
7.80
自引率
6.70%
发文量
80
审稿时长
6 months
期刊介绍: Health Informatics Journal is an international peer-reviewed journal. All papers submitted to Health Informatics Journal are subject to peer review by members of a carefully appointed editorial board. The journal operates a conventional single-blind reviewing policy in which the reviewer’s name is always concealed from the submitting author.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信