Large Language Models for Diagnosing Focal Liver Lesions From CT/MRI Reports: A Comparative Study With Radiologists

IF 6 2区 医学 Q1 GASTROENTEROLOGY & HEPATOLOGY
Liuji Sheng, Yidi Chen, Hong Wei, Feng Che, Yingyi Wu, Qin Qin, Chongtu Yang, Yanshu Wang, Jingwen Peng, Mustafa R. Bashir, Maxime Ronot, Bin Song, Hanyu Jiang
{"title":"Large Language Models for Diagnosing Focal Liver Lesions From CT/MRI Reports: A Comparative Study With Radiologists","authors":"Liuji Sheng,&nbsp;Yidi Chen,&nbsp;Hong Wei,&nbsp;Feng Che,&nbsp;Yingyi Wu,&nbsp;Qin Qin,&nbsp;Chongtu Yang,&nbsp;Yanshu Wang,&nbsp;Jingwen Peng,&nbsp;Mustafa R. Bashir,&nbsp;Maxime Ronot,&nbsp;Bin Song,&nbsp;Hanyu Jiang","doi":"10.1111/liv.70115","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background &amp; Aims</h3>\n \n <p>Whether large language models (LLMs) could be integrated into the diagnostic workflow of focal liver lesions (FLLs) remains unclear. We aimed to investigate two generic LLMs (ChatGPT-4o and Gemini) regarding their diagnostic accuracies referring to the CT/MRI reports, compared to and combined with radiologists of different experience levels.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>From April 2022 to April 2024, this single-center retrospective study included consecutive adult patients who underwent contrast-enhanced CT/MRI for single FLL and subsequent histopathologic examination. The LLMs were prompted by clinical information and the “findings” section of radiology reports three times to provide differential diagnoses in the descending order of likelihood, with the first considered the final diagnosis. In the research setting, six radiologists (three junior and three middle-level) independently reviewed the CT/MRI images and clinical information in two rounds (first alone, then with LLM assistance). In the clinical setting, diagnoses were retrieved from the “impressions” section of radiology reports. Diagnostic accuracy was investigated against histopathology.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>228 patients (median age, 59 years; 155 males) with 228 FLLs (median size, 3.6 cm) were included. Regarding the final diagnosis, the accuracy of <i>two-step</i> ChatGPT-4o (78.9%) was higher than <i>single-step</i> ChatGPT-4o (68.0%, <i>p</i> &lt; 0.001) and <i>single-step</i> Gemini (73.2%, <i>p</i> = 0.004), similar to real-world radiology reports (80.0%, <i>p</i> = 0.34) and junior radiologists (78.9%–82.0%; <i>p</i>-values, 0.21 to &gt; 0.99), but lower than middle-level radiologists (84.6%–85.5%; <i>p</i>-values, 0.001 to 0.02). No incremental diagnostic value of ChatGPT-4o was observed for any radiologist (<i>p</i>-values, 0.63 to &gt; 0.99).</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p><i>Two-step</i> ChatGPT-4o showed matching accuracies to real-world radiology reports and junior radiologists for diagnosing FLLs but was less accurate than middle-level radiologists and demonstrated little incremental diagnostic value.</p>\n </section>\n </div>","PeriodicalId":18101,"journal":{"name":"Liver International","volume":"45 6","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Liver International","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/liv.70115","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background & Aims

Whether large language models (LLMs) could be integrated into the diagnostic workflow of focal liver lesions (FLLs) remains unclear. We aimed to investigate two generic LLMs (ChatGPT-4o and Gemini) regarding their diagnostic accuracies referring to the CT/MRI reports, compared to and combined with radiologists of different experience levels.

Methods

From April 2022 to April 2024, this single-center retrospective study included consecutive adult patients who underwent contrast-enhanced CT/MRI for single FLL and subsequent histopathologic examination. The LLMs were prompted by clinical information and the “findings” section of radiology reports three times to provide differential diagnoses in the descending order of likelihood, with the first considered the final diagnosis. In the research setting, six radiologists (three junior and three middle-level) independently reviewed the CT/MRI images and clinical information in two rounds (first alone, then with LLM assistance). In the clinical setting, diagnoses were retrieved from the “impressions” section of radiology reports. Diagnostic accuracy was investigated against histopathology.

Results

228 patients (median age, 59 years; 155 males) with 228 FLLs (median size, 3.6 cm) were included. Regarding the final diagnosis, the accuracy of two-step ChatGPT-4o (78.9%) was higher than single-step ChatGPT-4o (68.0%, p < 0.001) and single-step Gemini (73.2%, p = 0.004), similar to real-world radiology reports (80.0%, p = 0.34) and junior radiologists (78.9%–82.0%; p-values, 0.21 to > 0.99), but lower than middle-level radiologists (84.6%–85.5%; p-values, 0.001 to 0.02). No incremental diagnostic value of ChatGPT-4o was observed for any radiologist (p-values, 0.63 to > 0.99).

Conclusion

Two-step ChatGPT-4o showed matching accuracies to real-world radiology reports and junior radiologists for diagnosing FLLs but was less accurate than middle-level radiologists and demonstrated little incremental diagnostic value.

从CT/MRI报告中诊断局灶性肝脏病变的大型语言模型:与放射科医生的比较研究
背景,目的大型语言模型(LLMs)是否可以整合到局灶性肝病变(fll)的诊断工作流程中尚不清楚。我们的目的是研究两个通用LLMs (chatgpt - 40和Gemini)在CT/MRI报告中的诊断准确性,与不同经验水平的放射科医生进行比较并结合。方法从2022年4月至2024年4月,本研究纳入了连续的成年患者,这些患者接受了单次FLL的增强CT/MRI检查和随后的组织病理学检查。临床信息和放射学报告的“发现”部分三次提示llm按可能性降序提供鉴别诊断,第一次考虑最终诊断。在研究环境中,六名放射科医生(三名初级和三名中级)分两轮(第一轮单独,然后在LLM的协助下)独立审查CT/MRI图像和临床信息。在临床设置中,诊断从放射学报告的“印象”部分检索。根据组织病理学检查诊断的准确性。结果228例患者(中位年龄59岁;155例男性),228例fll(中位尺寸为3.6 cm)。对于最终诊断,两步chatgpt - 40的准确率(78.9%)高于单步chatgpt - 40 (68.0%, p < 0.001)和单步Gemini (73.2%, p = 0.004),与真实世界的放射学报告(80.0%,p = 0.34)和初级放射科医师(78.9% - 82.0%;p值为0.21 ~ 0.99),但低于中级放射科医师(84.6% ~ 85.5%;p值,0.001 ~ 0.02)。chatgpt - 40对任何放射科医生都没有增加诊断价值(p值为0.63至0.99)。结论两步chatgpt - 40诊断fll的准确性与实际放射报告和初级放射科医生的准确性相匹配,但低于中级放射科医生的准确性,并且没有增加诊断价值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Liver International
Liver International 医学-胃肠肝病学
CiteScore
13.90
自引率
4.50%
发文量
348
审稿时长
2 months
期刊介绍: Liver International promotes all aspects of the science of hepatology from basic research to applied clinical studies. Providing an international forum for the publication of high-quality original research in hepatology, it is an essential resource for everyone working on normal and abnormal structure and function in the liver and its constituent cells, including clinicians and basic scientists involved in the multi-disciplinary field of hepatology. The journal welcomes articles from all fields of hepatology, which may be published as original articles, brief definitive reports, reviews, mini-reviews, images in hepatology and letters to the Editor.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信