Enhancing radiology training with GPT-4: Pilot analysis of automated feedback in trainee preliminary reports.

Wasif Bala, Hanzhou Li, John Moon, Hari Trivedi, Judy Gichoya, Patricia Balthazar
{"title":"Enhancing radiology training with GPT-4: Pilot analysis of automated feedback in trainee preliminary reports.","authors":"Wasif Bala, Hanzhou Li, John Moon, Hari Trivedi, Judy Gichoya, Patricia Balthazar","doi":"10.1067/j.cpradiol.2024.08.003","DOIUrl":null,"url":null,"abstract":"<p><strong>Rationale and objectives: </strong>Radiology residents often receive limited feedback on preliminary reports issued during independent call. This study aimed to determine if Large Language Models (LLMs) can supplement traditional feedback by identifying missed diagnoses in radiology residents' preliminary reports.</p><p><strong>Materials & methods: </strong>A randomly selected subset of 500 (250 train/250 validation) paired preliminary and final reports between 12/17/2022 and 5/22/2023 were extracted and de-identified from our institutional database. The prompts and report text were input into the GPT-4 language model via the GPT-4 API (gpt-4-0314 model version). Iterative prompt tuning was used on a subset of the training/validation sets to direct the model to identify important findings in the final report that were absent in preliminary reports. For testing, a subset of 10 reports with confirmed diagnostic errors were randomly selected. Fourteen residents with on-call experience assessed the LLM-generated discrepancies and completed a survey on their experience using a 5-point Likert scale.</p><p><strong>Results: </strong>The model identified 24 unique missed diagnoses across 10 test reports with i% model prediction accuracy as rated by 14 residents. Five additional diagnoses were identified by users, resulting in a model sensitivity of 79.2 %. Post-evaluation surveys showed a mean satisfaction rating of 3.50 and perceived accuracy rating of 3.64 out of 5 for LLM-generated feedback. Most respondents (71.4 %) favored a combination of LLM-generated and traditional feedback.</p><p><strong>Conclusion: </strong>This pilot study on the use of LLM-generated feedback for radiology resident preliminary reports demonstrated notable accuracy in identifying missed diagnoses and was positively received, highlighting LLMs' potential role in supplementing conventional feedback methods.</p>","PeriodicalId":93969,"journal":{"name":"Current problems in diagnostic radiology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current problems in diagnostic radiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1067/j.cpradiol.2024.08.003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Rationale and objectives: Radiology residents often receive limited feedback on preliminary reports issued during independent call. This study aimed to determine if Large Language Models (LLMs) can supplement traditional feedback by identifying missed diagnoses in radiology residents' preliminary reports.

Materials & methods: A randomly selected subset of 500 (250 train/250 validation) paired preliminary and final reports between 12/17/2022 and 5/22/2023 were extracted and de-identified from our institutional database. The prompts and report text were input into the GPT-4 language model via the GPT-4 API (gpt-4-0314 model version). Iterative prompt tuning was used on a subset of the training/validation sets to direct the model to identify important findings in the final report that were absent in preliminary reports. For testing, a subset of 10 reports with confirmed diagnostic errors were randomly selected. Fourteen residents with on-call experience assessed the LLM-generated discrepancies and completed a survey on their experience using a 5-point Likert scale.

Results: The model identified 24 unique missed diagnoses across 10 test reports with i% model prediction accuracy as rated by 14 residents. Five additional diagnoses were identified by users, resulting in a model sensitivity of 79.2 %. Post-evaluation surveys showed a mean satisfaction rating of 3.50 and perceived accuracy rating of 3.64 out of 5 for LLM-generated feedback. Most respondents (71.4 %) favored a combination of LLM-generated and traditional feedback.

Conclusion: This pilot study on the use of LLM-generated feedback for radiology resident preliminary reports demonstrated notable accuracy in identifying missed diagnoses and was positively received, highlighting LLMs' potential role in supplementing conventional feedback methods.

利用 GPT-4 加强放射学培训:对学员初步报告中的自动反馈进行试点分析。
理由和目标:放射科住院医师在独立调用期间发布的初步报告中收到的反馈通常很有限。本研究旨在确定大语言模型(LLMs)能否通过识别放射科住院医师初步报告中的漏诊来补充传统反馈:从我们的机构数据库中随机抽取了500份(250份训练/250份验证)成对的初步报告和最终报告,时间跨度为2022年12月17日到2023年5月22日。通过 GPT-4 API(gpt-4-0314 模型版本)将提示和报告文本输入 GPT-4 语言模型。对训练/验证集的一个子集进行了迭代提示调整,以指导模型在最终报告中识别初步报告中没有的重要发现。为进行测试,随机选取了 10 份确诊诊断错误的报告子集。14 名有值班经验的住院医师评估了 LLM 生成的差异,并使用 5 点李克特量表完成了他们的经验调查:结果:根据 14 位住院医师的评价,该模型在 10 份测试报告中识别出了 24 个独特的漏诊,模型预测准确率为 i%。用户还发现了另外 5 项诊断,模型灵敏度达到 79.2%。评估后调查显示,LLM 生成反馈的平均满意度为 3.50,感知准确度为 3.64(满分为 5 分)。大多数受访者(71.4%)赞成将 LLM 生成的反馈与传统反馈相结合:这项针对放射科住院医师初步报告使用由实验室管理员生成反馈意见的试点研究在识别漏诊方面表现出了显著的准确性,并获得了积极的反响,凸显了实验室管理员在补充传统反馈方法方面的潜在作用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信