Practical Evaluation of ChatGPT Performance for Radiology Report Generation

IF 3.8 2区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Mohsen Soleimani , Navisa Seyyedi , Seyed Mohammad Ayyoubzadeh , Sharareh Rostam Niakan Kalhori , Hamidreza Keshavarz
{"title":"Practical Evaluation of ChatGPT Performance for Radiology Report Generation","authors":"Mohsen Soleimani ,&nbsp;Navisa Seyyedi ,&nbsp;Seyed Mohammad Ayyoubzadeh ,&nbsp;Sharareh Rostam Niakan Kalhori ,&nbsp;Hamidreza Keshavarz","doi":"10.1016/j.acra.2024.07.020","DOIUrl":null,"url":null,"abstract":"<div><h3>Rationale and Objectives</h3><div>The process of generating radiology reports is often time-consuming and labor-intensive, prone to incompleteness, heterogeneity, and errors. By employing natural language processing (NLP)-based techniques, this study explores the potential for enhancing the efficiency of radiology report generation through the remarkable capabilities of ChatGPT (Generative Pre-training Transformer), a prominent large language model (LLM).</div></div><div><h3>Materials and Methods</h3><div>Using a sample of 1000 records from the Medical Information Mart for Intensive Care (MIMIC) Chest X-ray Database, this investigation employed Claude.ai to extract initial radiological report keywords. ChatGPT then generated radiology reports using a consistent 3-step prompt template outline. Various lexical and sentence similarity techniques were employed to evaluate the correspondence between the AI assistant-generated reports and reference reports authored by medical professionals.</div></div><div><h3>Results</h3><div>Results showed varying performance among NLP models, with Bart (Bidirectional and Auto-Regressive Transformers) and XLM (Cross-lingual Language Model) displaying high proficiency (mean similarity scores up to 99.3%), closely mirroring physician reports. Conversely, DeBERTa (Decoding-enhanced BERT with disentangled attention) and sequence-matching models scored lower, indicating less alignment with medical language. In the Impression section, the Word-Embedding model excelled with a mean similarity of 84.4%, while others like the Jaccard index showed lower performance.</div></div><div><h3>Conclusion</h3><div>Overall, the study highlights significant variations across NLP models in their ability to generate radiology reports consistent with medical professionals' language. Pairwise comparisons and Kruskal–Wallis tests confirmed these differences, emphasizing the need for careful selection and evaluation of NLP models in radiology report generation. This research underscores the potential of ChatGPT to streamline and improve the radiology reporting process, with implications for enhancing efficiency and accuracy in clinical practice.</div></div>","PeriodicalId":50928,"journal":{"name":"Academic Radiology","volume":"31 12","pages":"Pages 4823-4832"},"PeriodicalIF":3.8000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Academic Radiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1076633224004549","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Rationale and Objectives

The process of generating radiology reports is often time-consuming and labor-intensive, prone to incompleteness, heterogeneity, and errors. By employing natural language processing (NLP)-based techniques, this study explores the potential for enhancing the efficiency of radiology report generation through the remarkable capabilities of ChatGPT (Generative Pre-training Transformer), a prominent large language model (LLM).

Materials and Methods

Using a sample of 1000 records from the Medical Information Mart for Intensive Care (MIMIC) Chest X-ray Database, this investigation employed Claude.ai to extract initial radiological report keywords. ChatGPT then generated radiology reports using a consistent 3-step prompt template outline. Various lexical and sentence similarity techniques were employed to evaluate the correspondence between the AI assistant-generated reports and reference reports authored by medical professionals.

Results

Results showed varying performance among NLP models, with Bart (Bidirectional and Auto-Regressive Transformers) and XLM (Cross-lingual Language Model) displaying high proficiency (mean similarity scores up to 99.3%), closely mirroring physician reports. Conversely, DeBERTa (Decoding-enhanced BERT with disentangled attention) and sequence-matching models scored lower, indicating less alignment with medical language. In the Impression section, the Word-Embedding model excelled with a mean similarity of 84.4%, while others like the Jaccard index showed lower performance.

Conclusion

Overall, the study highlights significant variations across NLP models in their ability to generate radiology reports consistent with medical professionals' language. Pairwise comparisons and Kruskal–Wallis tests confirmed these differences, emphasizing the need for careful selection and evaluation of NLP models in radiology report generation. This research underscores the potential of ChatGPT to streamline and improve the radiology reporting process, with implications for enhancing efficiency and accuracy in clinical practice.
实用评估用于生成放射报告的 ChatGPT 性能。
理由和目标:生成放射学报告的过程通常耗时耗力,容易出现不完整、异质性和错误。本研究采用基于自然语言处理(NLP)的技术,通过著名的大型语言模型(LLM)ChatGPT(生成式预训练转换器)的卓越功能,探索提高放射学报告生成效率的潜力:本研究使用 Claude.ai 从重症监护医学信息市场(MIMIC)胸部 X 光数据库中抽取了 1000 条记录,提取了最初的放射报告关键词。然后,ChatGPT 使用一致的三步提示模板大纲生成放射学报告。采用了各种词性和句子相似性技术来评估人工智能助手生成的报告与医学专业人员撰写的参考报告之间的对应关系:结果表明,NLP 模型的性能各不相同,其中 Bart(双向和自回归变换器)和 XLM(跨语言语言模型)显示出较高的熟练度(平均相似度得分高达 99.3%),与医生的报告密切相关。相反,DeBERTa(解码增强 BERT,注意力分离)和序列匹配模型得分较低,表明与医学语言的一致性较差。在 "印象 "部分,单词嵌入模型表现出色,平均相似度为 84.4%,而 Jaccard 指数等其他模型的表现较差:总之,这项研究强调了不同的 NLP 模型在生成与医学专业人员语言一致的放射学报告方面存在显著差异。配对比较和 Kruskal-Wallis 检验证实了这些差异,强调了在生成放射学报告时仔细选择和评估 NLP 模型的必要性。这项研究强调了 ChatGPT 在简化和改进放射学报告流程方面的潜力,对提高临床实践的效率和准确性具有重要意义。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Academic Radiology
Academic Radiology 医学-核医学
CiteScore
7.60
自引率
10.40%
发文量
432
审稿时长
18 days
期刊介绍: Academic Radiology publishes original reports of clinical and laboratory investigations in diagnostic imaging, the diagnostic use of radioactive isotopes, computed tomography, positron emission tomography, magnetic resonance imaging, ultrasound, digital subtraction angiography, image-guided interventions and related techniques. It also includes brief technical reports describing original observations, techniques, and instrumental developments; state-of-the-art reports on clinical issues, new technology and other topics of current medical importance; meta-analyses; scientific studies and opinions on radiologic education; and letters to the Editor.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信