Evaluating a large language model's ability to answer clinicians' requests for evidence summaries.

IF 5.1 4区医学 Q1 INFORMATION SCIENCE & LIBRARY SCIENCE

Journal of the Medical Library Association Pub Date : 2025-01-14 DOI:10.5195/jmla.2025.1985

Mallory N Blasingame, Taneya Y Koonce, Annette M Williams, Dario A Giuse, Jing Su, Poppy A Krump, Nunzia Bettinsoli Giuse

{"title":"Evaluating a large language model's ability to answer clinicians' requests for evidence summaries.","authors":"Mallory N Blasingame, Taneya Y Koonce, Annette M Williams, Dario A Giuse, Jing Su, Poppy A Krump, Nunzia Bettinsoli Giuse","doi":"10.5195/jmla.2025.1985","DOIUrl":null,"url":null,"abstract":"Objective: This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses.Methods: Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.Results: Of the 216 evaluated questions, aiChat's response was assessed as \"correct\" for 180 (83.3%) questions, \"partially correct\" for 35 (16.2%) questions, and \"incorrect\" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated.Conclusions: Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.","PeriodicalId":47690,"journal":{"name":"Journal of the Medical Library Association","volume":"113 1","pages":"65-77"},"PeriodicalIF":5.1000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11835037/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Medical Library Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.5195/jmla.2025.1985","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses.

Methods: Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat.

Results: Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated.

Conclusions: Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.

查看原文本刊更多论文

评估大型语言模型回答临床医生对证据摘要要求的能力。

目的：本研究考察了使用GPT-4的生成式人工智能（AI）工具在回答临床问题方面的性能，并与医学图书馆员的金标准证据合成进行了比较。方法：从医学图书馆员先前回答的临床证据请求的内部数据库中提取问题。有多个部分的问题被细分为单独的主题。使用COSTAR框架开发了标准化提示。图书馆员将每个问题提交到aiChat，这是一个内部管理的使用GPT-4的聊天工具，并记录下回答。对aiChat生成的摘要进行评估，看它们是否包含在已建立的图书管理员黄金标准摘要中使用的关键元素。随机抽取问题子集对aiChat提供的参考文献进行验证。结果：在216个评估问题中，aiChat的回答被评估为180个（83.3%）问题“正确”，35个（16.2%）问题“部分正确”，1个（0.5%）问题“不正确”。不同问题类别的问题评分差异无统计学意义（p=0.73）。对于30% （n=66）的问题子集，在aiChat摘要中提供了162个参考文献，其中60个（37%）被确认为非捏造。结论：总体而言，生成式人工智能工具的性能是有希望的。然而，许多纳入的参考文献无法独立验证，并且没有尝试评估aiChat引入的任何其他概念是否在事实上准确。因此，我们设想这是一系列调查的第一个，旨在进一步了解当前和未来版本的生成式人工智能如何被使用并集成到医学图书馆员的工作流程中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the Medical Library Association INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

4.10

自引率

10.00%

发文量

审稿时长

26 weeks

期刊介绍： The Journal of the Medical Library Association (JMLA) is an international, peer-reviewed journal published quarterly that aims to advance the practice and research knowledgebase of health sciences librarianship. The most current impact factor for the JMLA (from the 2007 edition of Journal Citation Reports) is 1.392.