Michael Scheschenja, Moritz B. Bastian, Joel Wessendorf, Andreas D. Owczarek, Alexander M. König, Simon Viniol , Andreas H. Mahnken
{"title":"ChatGPT: Evaluating answers on contrast media related questions and finetuning by providing the model with the ESUR guideline on contrast agents","authors":"Michael Scheschenja, Moritz B. Bastian, Joel Wessendorf, Andreas D. Owczarek, Alexander M. König, Simon Viniol , Andreas H. Mahnken","doi":"10.1067/j.cpradiol.2024.04.005","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>This study aimed to assess the feasibility of GPT-4 for answering questions related to contrast media with and without the context of the European Society of Urogenital Radiology (ESUR) guideline on contrast agents. The overarching goal was to determine whether contextual enrichment by providing guideline information improves answers of GPT-4 for clinical decision-making in radiology.</p></div><div><h3>Methods</h3><p>A set of 64 questions, based on the ESUR guideline on contrast agents mirroring pertinent sections, was developed and posed to GPT-4 both directly and after providing the guideline using a plugin. Responses were graded by experienced radiologists for quality of information and accuracy in pinpointing information from the guideline as well as by radiology residents for utility, using Likert-scales.</p></div><div><h3>Results</h3><p>GPT-4′s performance improved significantly with the guideline. Without the guideline, average quality rating was 3.98, which increased to 4.33 with the guideline (p = 0036). In terms of accuracy, 82.3% of answers matched the information from the guideline. Utility scores also reflected a significant improvement with the guideline, with average scores of 4.1 (without) and 4.4 (with) (p = 0.008) with a Fleiss´ Kappa of 0.44.</p></div><div><h3>Conclusion</h3><p>GPT-4, when contextually enriched with a guideline, demonstrates enhanced capability in providing guideline-backed recommendations. This approach holds promise for real-time clinical decision-support, making guidelines more actionable. However, further refinements are necessary to maximize the potential of large language models (LLMs). Inherent limitations need to be addressed.</p></div>","PeriodicalId":51617,"journal":{"name":"Current Problems in Diagnostic Radiology","volume":"53 4","pages":"Pages 488-493"},"PeriodicalIF":1.5000,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0363018824000756/pdfft?md5=be719d0b05b27c0bc496928c92081deb&pid=1-s2.0-S0363018824000756-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Problems in Diagnostic Radiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0363018824000756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
This study aimed to assess the feasibility of GPT-4 for answering questions related to contrast media with and without the context of the European Society of Urogenital Radiology (ESUR) guideline on contrast agents. The overarching goal was to determine whether contextual enrichment by providing guideline information improves answers of GPT-4 for clinical decision-making in radiology.
Methods
A set of 64 questions, based on the ESUR guideline on contrast agents mirroring pertinent sections, was developed and posed to GPT-4 both directly and after providing the guideline using a plugin. Responses were graded by experienced radiologists for quality of information and accuracy in pinpointing information from the guideline as well as by radiology residents for utility, using Likert-scales.
Results
GPT-4′s performance improved significantly with the guideline. Without the guideline, average quality rating was 3.98, which increased to 4.33 with the guideline (p = 0036). In terms of accuracy, 82.3% of answers matched the information from the guideline. Utility scores also reflected a significant improvement with the guideline, with average scores of 4.1 (without) and 4.4 (with) (p = 0.008) with a Fleiss´ Kappa of 0.44.
Conclusion
GPT-4, when contextually enriched with a guideline, demonstrates enhanced capability in providing guideline-backed recommendations. This approach holds promise for real-time clinical decision-support, making guidelines more actionable. However, further refinements are necessary to maximize the potential of large language models (LLMs). Inherent limitations need to be addressed.
期刊介绍:
Current Problems in Diagnostic Radiology covers important and controversial topics in radiology. Each issue presents important viewpoints from leading radiologists. High-quality reproductions of radiographs, CT scans, MR images, and sonograms clearly depict what is being described in each article. Also included are valuable updates relevant to other areas of practice, such as medical-legal issues or archiving systems. With new multi-topic format and image-intensive style, Current Problems in Diagnostic Radiology offers an outstanding, time-saving investigation into current topics most relevant to radiologists.