Evaluating a large language model's ability to answer clinicians' requests for evidence summaries.

IF 2.9 4区 医学 Q1 INFORMATION SCIENCE & LIBRARY SCIENCE
Mallory N Blasingame, Taneya Y Koonce, Annette M Williams, Dario A Giuse, Jing Su, Poppy A Krump, Nunzia Bettinsoli Giuse
{"title":"Evaluating a large language model's ability to answer clinicians' requests for evidence summaries.","authors":"Mallory N Blasingame, Taneya Y Koonce, Annette M Williams, Dario A Giuse, Jing Su, Poppy A Krump, Nunzia Bettinsoli Giuse","doi":"10.5195/jmla.2025.1985","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses.</p><p><strong>Methods: </strong>Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.</p><p><strong>Results: </strong>Of the 216 evaluated questions, aiChat's response was assessed as \"correct\" for 180 (83.3%) questions, \"partially correct\" for 35 (16.2%) questions, and \"incorrect\" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated.</p><p><strong>Conclusions: </strong>Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.</p>","PeriodicalId":47690,"journal":{"name":"Journal of the Medical Library Association","volume":"113 1","pages":"65-77"},"PeriodicalIF":2.9000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11835037/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Medical Library Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.5195/jmla.2025.1985","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses.

Methods: Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.

Results: Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated.

Conclusions: Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.

求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of the Medical Library Association
Journal of the Medical Library Association INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
4.10
自引率
10.00%
发文量
39
审稿时长
26 weeks
期刊介绍: The Journal of the Medical Library Association (JMLA) is an international, peer-reviewed journal published quarterly that aims to advance the practice and research knowledgebase of health sciences librarianship. The most current impact factor for the JMLA (from the 2007 edition of Journal Citation Reports) is 1.392.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信