Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan.

IF 2.1 4区 医学
Japanese Journal of Radiology Pub Date : 2025-09-01 Epub Date: 2025-05-14 DOI:10.1007/s11604-025-01799-1
Hirotaka Takita, Shannon L Walston, Yasuhito Mitsuyama, Ko Watanabe, Shoya Ishimaru, Daiju Ueda
{"title":"Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan.","authors":"Hirotaka Takita, Shannon L Walston, Yasuhito Mitsuyama, Ko Watanabe, Shoya Ishimaru, Daiju Ueda","doi":"10.1007/s11604-025-01799-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To compare the diagnostic performance of three proprietary large language models (LLMs)-Claude, GPT, and Gemini-in structuring free-text Japanese radiology reports for intracranial hemorrhage and skull fractures, and to assess the impact of three different prompting approaches on model accuracy.</p><p><strong>Materials and methods: </strong>In this retrospective study, head CT reports from the Japan Medical Imaging Database between 2018 and 2023 were collected. Two board-certified radiologists established the ground truth regarding intracranial hemorrhage and skull fractures through independent review and consensus. Each radiology report was analyzed by three LLMs using three prompting strategies-Standard, Chain of Thought, and Self Consistency prompting. Diagnostic performance (accuracy, precision, recall, and F1-score) was calculated for each LLM-prompt combination and compared using McNemar's tests with Bonferroni correction. Misclassified cases underwent qualitative error analysis.</p><p><strong>Results: </strong>A total of 3949 head CT reports from 3949 patients (mean age 59 ± 25 years, 56.2% male) were enrolled. Across all institutions, 856 patients (21.6%) had intracranial hemorrhage and 264 patients (6.6%) had skull fractures. All nine LLM-prompt combinations achieved very high accuracy. Claude demonstrated significantly higher accuracy for intracranial hemorrhage than GPT and Gemini, and also outperformed Gemini for skull fractures (p < 0.0001). Gemini's performance improved notably with Chain of Thought prompting. Error analysis revealed common challenges including ambiguous phrases and findings unrelated to intracranial hemorrhage or skull fractures, underscoring the importance of careful prompt design.</p><p><strong>Conclusion: </strong>All three proprietary LLMs exhibited strong performance in structuring free-text head CT reports for intracranial hemorrhage and skull fractures. While the choice of prompting method influenced accuracy, all models demonstrated robust potential for clinical and research applications. Future work should refine the prompts and validate these approaches in prospective, multilingual settings.</p>","PeriodicalId":14691,"journal":{"name":"Japanese Journal of Radiology","volume":" ","pages":"1445-1455"},"PeriodicalIF":2.1000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12396994/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Japanese Journal of Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s11604-025-01799-1","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/14 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: To compare the diagnostic performance of three proprietary large language models (LLMs)-Claude, GPT, and Gemini-in structuring free-text Japanese radiology reports for intracranial hemorrhage and skull fractures, and to assess the impact of three different prompting approaches on model accuracy.

Materials and methods: In this retrospective study, head CT reports from the Japan Medical Imaging Database between 2018 and 2023 were collected. Two board-certified radiologists established the ground truth regarding intracranial hemorrhage and skull fractures through independent review and consensus. Each radiology report was analyzed by three LLMs using three prompting strategies-Standard, Chain of Thought, and Self Consistency prompting. Diagnostic performance (accuracy, precision, recall, and F1-score) was calculated for each LLM-prompt combination and compared using McNemar's tests with Bonferroni correction. Misclassified cases underwent qualitative error analysis.

Results: A total of 3949 head CT reports from 3949 patients (mean age 59 ± 25 years, 56.2% male) were enrolled. Across all institutions, 856 patients (21.6%) had intracranial hemorrhage and 264 patients (6.6%) had skull fractures. All nine LLM-prompt combinations achieved very high accuracy. Claude demonstrated significantly higher accuracy for intracranial hemorrhage than GPT and Gemini, and also outperformed Gemini for skull fractures (p < 0.0001). Gemini's performance improved notably with Chain of Thought prompting. Error analysis revealed common challenges including ambiguous phrases and findings unrelated to intracranial hemorrhage or skull fractures, underscoring the importance of careful prompt design.

Conclusion: All three proprietary LLMs exhibited strong performance in structuring free-text head CT reports for intracranial hemorrhage and skull fractures. While the choice of prompting method influenced accuracy, all models demonstrated robust potential for clinical and research applications. Future work should refine the prompts and validate these approaches in prospective, multilingual settings.

Abstract Image

Abstract Image

大型语言模型在构建头部CT放射学报告中的比较性能:日本的多机构验证研究。
目的:比较三种专有的大语言模型(LLMs)——claude、GPT和gemini在构建自由文本日本放射学报告中对颅内出血和颅骨骨折的诊断性能,并评估三种不同提示方法对模型准确性的影响。材料和方法:本回顾性研究收集了日本医学影像数据库2018 - 2023年的头部CT报告。两位委员会认证的放射科医生通过独立的审查和共识建立了关于颅内出血和颅骨骨折的基本事实。每个放射学报告由三位法学硕士使用三种提示策略进行分析-标准,思维链和自我一致性提示。计算每个llm提示组合的诊断性能(准确性、精密度、召回率和f1评分),并使用McNemar检验和Bonferroni校正进行比较。错误分类病例进行定性错误分析。结果:共纳入3949例患者(平均年龄59±25岁,男性56.2%)的3949份头部CT报告。在所有机构中,856例(21.6%)患者发生颅内出血,264例(6.6%)患者发生颅骨骨折。所有9种llm提示组合都达到了非常高的准确率。Claude在颅内出血方面的准确性明显高于GPT和Gemini,并且在颅骨骨折方面也优于Gemini (p结论:所有三种专有LLMs在颅内出血和颅骨骨折的自由文本头部CT报告中都表现出很强的性能。虽然提示方法的选择影响准确性,但所有模型都显示出临床和研究应用的强大潜力。未来的工作应该完善提示,并在前瞻性的多语言环境中验证这些方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Japanese Journal of Radiology
Japanese Journal of Radiology Medicine-Radiology, Nuclear Medicine and Imaging
自引率
4.80%
发文量
133
期刊介绍: Japanese Journal of Radiology is a peer-reviewed journal, officially published by the Japan Radiological Society. The main purpose of the journal is to provide a forum for the publication of papers documenting recent advances and new developments in the field of radiology in medicine and biology. The scope of Japanese Journal of Radiology encompasses but is not restricted to diagnostic radiology, interventional radiology, radiation oncology, nuclear medicine, radiation physics, and radiation biology. Additionally, the journal covers technical and industrial innovations. The journal welcomes original articles, technical notes, review articles, pictorial essays and letters to the editor. The journal also provides announcements from the boards and the committees of the society. Membership in the Japan Radiological Society is not a prerequisite for submission. Contributions are welcomed from all parts of the world.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信