Comparing Five Generative AI Chatbots' Answers to LLM-Generated Clinical Questions with Medical Information Scientists' Evidence Summaries.

medRxiv : the preprint server for health sciences Pub Date : 2025-09-27 DOI:10.1101/2025.09.24.25336199

Mallory N Blasingame, Taneya Y Koonce, Annette M Williams, Jing Su, Dario A Giuse, Poppy A Krump, Nunzia B Giuse

{"title":"Comparing Five Generative AI Chatbots' Answers to LLM-Generated Clinical Questions with Medical Information Scientists' Evidence Summaries.","authors":"Mallory N Blasingame, Taneya Y Koonce, Annette M Williams, Jing Su, Dario A Giuse, Poppy A Krump, Nunzia B Giuse","doi":"10.1101/2025.09.24.25336199","DOIUrl":null,"url":null,"abstract":"Objective: To compare answers to clinical questions between five publicly available large language model (LLM) chatbots and information scientists.Methods: LLMs were prompted to provide 45 PICO (patient, intervention, comparison, outcome) questions addressing treatment, prognosis, and etiology. Each question was answered by a medical information scientist and submitted to five LLM tools: ChatGPT, Gemini, Copilot, DeepSeek, and Grok-3. Key elements from the answers provided were used by pairs of information scientists to label each LLM answer as in Total Alignment, Partial Alignment, or No Alignment with the information scientist. The Partial Alignment answers were also analyzed for the inclusion of additional information.Results: The entire LLM set of answers, 225 in total, were assessed as being in Total Alignment 20.9% of the time (n=47), in Partial Alignment 78.7% of the time (n=177), and in No Alignment 0.4% of the time (n=1). Kruskal-Wallis testing found no significant performance difference in alignment ratings between the five chatbots (p=0.46). An analysis of the partially aligned answers found a significant difference in the number of additional elements provided by the information scientists versus the chatbots per Wilcoxon-Rank Sum testing (p=0.02).Discussion: Five chatbots did not differ significantly in their alignment with information scientists' evidence summaries. The analysis of partially aligned answers found both chatbots and information scientists included additional information, with information scientists doing so significantly more often. An important next step will be to assess the additional information both from the chatbots and the information scientists for validity and relevance.","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12486027/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.09.24.25336199","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To compare answers to clinical questions between five publicly available large language model (LLM) chatbots and information scientists.

Methods: LLMs were prompted to provide 45 PICO (patient, intervention, comparison, outcome) questions addressing treatment, prognosis, and etiology. Each question was answered by a medical information scientist and submitted to five LLM tools: ChatGPT, Gemini, Copilot, DeepSeek, and Grok-3. Key elements from the answers provided were used by pairs of information scientists to label each LLM answer as in Total Alignment, Partial Alignment, or No Alignment with the information scientist. The Partial Alignment answers were also analyzed for the inclusion of additional information.

Results: The entire LLM set of answers, 225 in total, were assessed as being in Total Alignment 20.9% of the time (n=47), in Partial Alignment 78.7% of the time (n=177), and in No Alignment 0.4% of the time (n=1). Kruskal-Wallis testing found no significant performance difference in alignment ratings between the five chatbots (p=0.46). An analysis of the partially aligned answers found a significant difference in the number of additional elements provided by the information scientists versus the chatbots per Wilcoxon-Rank Sum testing (p=0.02).

Discussion: Five chatbots did not differ significantly in their alignment with information scientists' evidence summaries. The analysis of partially aligned answers found both chatbots and information scientists included additional information, with information scientists doing so significantly more often. An important next step will be to assess the additional information both from the chatbots and the information scientists for validity and relevance.

查看原文本刊更多论文

比较五个生成式AI聊天机器人对法学硕士生成的临床问题的回答与医学信息科学家的证据摘要。

目的：比较五种公开的大语言模型（LLM）聊天机器人与信息科学家对临床问题的回答。方法：llm被要求提供45个PICO（患者、干预、比较、结果）问题，涉及治疗、预后和病因。每个问题都由一位医学信息科学家回答，并提交给五个LLM工具：ChatGPT、Gemini、Copilot、DeepSeek和Grok-3。一对对信息科学家使用所提供答案中的关键元素将每个LLM答案标记为与信息科学家完全一致、部分一致或不一致。还分析了部分对齐答案是否包含其他信息。结果：整个LLM答案集，总共225个，被评估为完全一致的时间占20.9% (n=47)，部分一致的时间占78.7% (n=177)，不一致的时间占0.4% （n=1）。Kruskal-Wallis测试发现，五个聊天机器人在对齐评级方面没有显著的性能差异（p =0.46）。对部分一致答案的分析发现，根据Wilcoxon-Rank Sum测试，信息科学家和聊天机器人提供的额外元素数量存在显著差异（p =0.02）。讨论：五个聊天机器人在与信息科学家的证据总结的一致性方面没有显着差异。对部分一致答案的分析发现，聊天机器人和信息科学家都包含了额外的信息，信息科学家这样做的频率要高得多。重要的下一步将是评估来自聊天机器人和信息科学家的额外信息的有效性和相关性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv : the preprint server for health sciences

自引率

0.00%

发文量