人工智能大型语言模型（Copilot和Gemini）在医疗保健政策制定方面与人类专家的性能比较：一项混合方法横断面研究。

IF 2.3 3区医学 Q2 HEALTH CARE SCIENCES & SERVICES

Health Informatics Journal Pub Date : 2025-07-01 Epub Date: 2025-09-22 DOI:10.1177/14604582251381269

Mohsen Khosravi, Reyhane Izadi, Mina Aghamaleki Sarvestani, Hossein Bouzarjomehri, Milad Ahmadi Marzaleh, Ramin Ravangard

{"title":"人工智能大型语言模型（Copilot和Gemini）在医疗保健政策制定方面与人类专家的性能比较：一项混合方法横断面研究。","authors":"Mohsen Khosravi, Reyhane Izadi, Mina Aghamaleki Sarvestani, Hossein Bouzarjomehri, Milad Ahmadi Marzaleh, Ramin Ravangard","doi":"10.1177/14604582251381269","DOIUrl":null,"url":null,"abstract":"ObjectiveThis study aimed to assess the performance of Artificial Intelligence (AI) compared to human experts in healthcare policymaking.MethodsThis was a mixed-methods cross-sectional study conducted in Iran during the years 2024-2025, comparing, and analyzing the responses of multiple AI Large Language Models (LLMs) including Bing AI Copilot and Gemini and a sample of 15 human experts-using confusion matrix analysis. This analysis provided comprehensive data on the respondents' ability to answer context-specific questions regarding healthcare policy making, evaluated through multiple parameters including sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and overall accuracy.ResultsCopilot demonstrated a sensitivity of 0.867, specificity of 0, PPV of 0.722, NPV of 0, and accuracy of 0.65. In comparison, Gemini exhibited a sensitivity of 0.733, specificity of 0.4, PPV of 0.786, NPV of 0.333, and also an accuracy of 0.65. Additionally, the human experts' responses indicated a sensitivity of 0.5808, specificity of 0.2571, PPV of 0.7189, NPV of 0.1579, and an accuracy of 0.5050.ConclusionThe AI LLMs outperformed human experts in responding to the study questionnaire. The findings demonstrated the considerable potential of the LLMs in enhancing healthcare policy-making, particularly by serving as complementary tools and collaborators alongside humans.","PeriodicalId":55069,"journal":{"name":"Health Informatics Journal","volume":"31 3","pages":"14604582251381269"},"PeriodicalIF":2.3000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of artificial intelligence large language models (Copilot and Gemini) compared to human experts in healthcare policy making: A mixed-methods cross-sectional study.\",\"authors\":\"Mohsen Khosravi, Reyhane Izadi, Mina Aghamaleki Sarvestani, Hossein Bouzarjomehri, Milad Ahmadi Marzaleh, Ramin Ravangard\",\"doi\":\"10.1177/14604582251381269\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ObjectiveThis study aimed to assess the performance of Artificial Intelligence (AI) compared to human experts in healthcare policymaking.MethodsThis was a mixed-methods cross-sectional study conducted in Iran during the years 2024-2025, comparing, and analyzing the responses of multiple AI Large Language Models (LLMs) including Bing AI Copilot and Gemini and a sample of 15 human experts-using confusion matrix analysis. This analysis provided comprehensive data on the respondents' ability to answer context-specific questions regarding healthcare policy making, evaluated through multiple parameters including sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and overall accuracy.ResultsCopilot demonstrated a sensitivity of 0.867, specificity of 0, PPV of 0.722, NPV of 0, and accuracy of 0.65. In comparison, Gemini exhibited a sensitivity of 0.733, specificity of 0.4, PPV of 0.786, NPV of 0.333, and also an accuracy of 0.65. Additionally, the human experts' responses indicated a sensitivity of 0.5808, specificity of 0.2571, PPV of 0.7189, NPV of 0.1579, and an accuracy of 0.5050.ConclusionThe AI LLMs outperformed human experts in responding to the study questionnaire. The findings demonstrated the considerable potential of the LLMs in enhancing healthcare policy-making, particularly by serving as complementary tools and collaborators alongside humans.\",\"PeriodicalId\":55069,\"journal\":{\"name\":\"Health Informatics Journal\",\"volume\":\"31 3\",\"pages\":\"14604582251381269\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Health Informatics Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/14604582251381269\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/9/22 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Informatics Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/14604582251381269","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/22 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

目的本研究旨在评估人工智能（AI）与人类专家在医疗保健决策中的表现。方法：这是一项混合方法的横断面研究，于2024-2025年在伊朗进行，使用混淆矩阵分析，比较和分析了包括Bing AI Copilot和Gemini在内的多个AI大型语言模型（llm）和15名人类专家的反应。该分析提供了关于受访者回答有关医疗保健政策制定的特定情境问题的能力的综合数据，通过多个参数进行评估，包括敏感性、特异性、阴性预测值（NPV）、阳性预测值（PPV）和总体准确性。结果scopilot的敏感性为0.867，特异性为0，PPV为0.722，NPV为0，准确率为0.65。相比之下，Gemini的敏感性为0.733，特异性为0.4，PPV为0.786，NPV为0.333，准确性为0.65。此外，人类专家的反应灵敏度为0.5808，特异性为0.2571，PPV为0.7189，NPV为0.1579，准确性为0.5050。结论人工智能法学硕士在回答研究问卷方面优于人类专家。研究结果表明，法学硕士在加强医疗保健决策方面具有相当大的潜力，特别是作为人类的补充工具和合作者。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance of artificial intelligence large language models (Copilot and Gemini) compared to human experts in healthcare policy making: A mixed-methods cross-sectional study.

ObjectiveThis study aimed to assess the performance of Artificial Intelligence (AI) compared to human experts in healthcare policymaking.MethodsThis was a mixed-methods cross-sectional study conducted in Iran during the years 2024-2025, comparing, and analyzing the responses of multiple AI Large Language Models (LLMs) including Bing AI Copilot and Gemini and a sample of 15 human experts-using confusion matrix analysis. This analysis provided comprehensive data on the respondents' ability to answer context-specific questions regarding healthcare policy making, evaluated through multiple parameters including sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and overall accuracy.ResultsCopilot demonstrated a sensitivity of 0.867, specificity of 0, PPV of 0.722, NPV of 0, and accuracy of 0.65. In comparison, Gemini exhibited a sensitivity of 0.733, specificity of 0.4, PPV of 0.786, NPV of 0.333, and also an accuracy of 0.65. Additionally, the human experts' responses indicated a sensitivity of 0.5808, specificity of 0.2571, PPV of 0.7189, NPV of 0.1579, and an accuracy of 0.5050.ConclusionThe AI LLMs outperformed human experts in responding to the study questionnaire. The findings demonstrated the considerable potential of the LLMs in enhancing healthcare policy-making, particularly by serving as complementary tools and collaborators alongside humans.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Health Informatics Journal HEALTH CARE SCIENCES & SERVICES-MEDICAL INFORMATICS

CiteScore

7.80

自引率

6.70%

发文量

审稿时长

6 months

期刊介绍： Health Informatics Journal is an international peer-reviewed journal. All papers submitted to Health Informatics Journal are subject to peer review by members of a carefully appointed editorial board. The journal operates a conventional single-blind reviewing policy in which the reviewer’s name is always concealed from the submitting author.