Wasif Bala MD, Hanzhou Li MD, John Moon MD, Hari Trivedi MD, Judy Gichoya MD, MS, Patricia Balthazar MD, CIIP
{"title":"Enhancing radiology training with GPT-4: Pilot analysis of automated feedback in trainee preliminary reports","authors":"Wasif Bala MD, Hanzhou Li MD, John Moon MD, Hari Trivedi MD, Judy Gichoya MD, MS, Patricia Balthazar MD, CIIP","doi":"10.1067/j.cpradiol.2024.08.003","DOIUrl":null,"url":null,"abstract":"<div><h3>Rationale and objectives</h3><div>Radiology residents often receive limited feedback on preliminary reports issued during independent call. This study aimed to determine if Large Language Models (LLMs) can supplement traditional feedback by identifying missed diagnoses in radiology residents’ preliminary reports.</div></div><div><h3>Materials & methods</h3><div>A randomly selected subset of 500 (250 train/250 validation) paired preliminary and final reports between 12/17/2022 and 5/22/2023 were extracted and de-identified from our institutional database. The prompts and report text were input into the GPT-4 language model via the GPT-4 API (gpt-4-0314 model version). Iterative prompt tuning was used on a subset of the training/validation sets to direct the model to identify important findings in the final report that were absent in preliminary reports. For testing, a subset of 10 reports with confirmed diagnostic errors were randomly selected. Fourteen residents with on-call experience assessed the LLM-generated discrepancies and completed a survey on their experience using a 5-point Likert scale.</div></div><div><h3>Results</h3><div>The model identified 24 unique missed diagnoses across 10 test reports with i% model prediction accuracy as rated by 14 residents. Five additional diagnoses were identified by users, resulting in a model sensitivity of 79.2 %. Post-evaluation surveys showed a mean satisfaction rating of 3.50 and perceived accuracy rating of 3.64 out of 5 for LLM-generated feedback. Most respondents (71.4 %) favored a combination of LLM-generated and traditional feedback.</div></div><div><h3>Conclusion</h3><div>This pilot study on the use of LLM-generated feedback for radiology resident preliminary reports demonstrated notable accuracy in identifying missed diagnoses and was positively received, highlighting LLMs' potential role in supplementing conventional feedback methods.</div></div>","PeriodicalId":51617,"journal":{"name":"Current Problems in Diagnostic Radiology","volume":"54 2","pages":"Pages 151-158"},"PeriodicalIF":1.5000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Problems in Diagnostic Radiology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S036301882400149X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Rationale and objectives
Radiology residents often receive limited feedback on preliminary reports issued during independent call. This study aimed to determine if Large Language Models (LLMs) can supplement traditional feedback by identifying missed diagnoses in radiology residents’ preliminary reports.
Materials & methods
A randomly selected subset of 500 (250 train/250 validation) paired preliminary and final reports between 12/17/2022 and 5/22/2023 were extracted and de-identified from our institutional database. The prompts and report text were input into the GPT-4 language model via the GPT-4 API (gpt-4-0314 model version). Iterative prompt tuning was used on a subset of the training/validation sets to direct the model to identify important findings in the final report that were absent in preliminary reports. For testing, a subset of 10 reports with confirmed diagnostic errors were randomly selected. Fourteen residents with on-call experience assessed the LLM-generated discrepancies and completed a survey on their experience using a 5-point Likert scale.
Results
The model identified 24 unique missed diagnoses across 10 test reports with i% model prediction accuracy as rated by 14 residents. Five additional diagnoses were identified by users, resulting in a model sensitivity of 79.2 %. Post-evaluation surveys showed a mean satisfaction rating of 3.50 and perceived accuracy rating of 3.64 out of 5 for LLM-generated feedback. Most respondents (71.4 %) favored a combination of LLM-generated and traditional feedback.
Conclusion
This pilot study on the use of LLM-generated feedback for radiology resident preliminary reports demonstrated notable accuracy in identifying missed diagnoses and was positively received, highlighting LLMs' potential role in supplementing conventional feedback methods.
期刊介绍:
Current Problems in Diagnostic Radiology covers important and controversial topics in radiology. Each issue presents important viewpoints from leading radiologists. High-quality reproductions of radiographs, CT scans, MR images, and sonograms clearly depict what is being described in each article. Also included are valuable updates relevant to other areas of practice, such as medical-legal issues or archiving systems. With new multi-topic format and image-intensive style, Current Problems in Diagnostic Radiology offers an outstanding, time-saving investigation into current topics most relevant to radiologists.