Wasif Bala, Hanzhou Li, John Moon, Hari Trivedi, Judy Gichoya, Patricia Balthazar
{"title":"Enhancing radiology training with GPT-4: Pilot analysis of automated feedback in trainee preliminary reports.","authors":"Wasif Bala, Hanzhou Li, John Moon, Hari Trivedi, Judy Gichoya, Patricia Balthazar","doi":"10.1067/j.cpradiol.2024.08.003","DOIUrl":null,"url":null,"abstract":"<p><strong>Rationale and objectives: </strong>Radiology residents often receive limited feedback on preliminary reports issued during independent call. This study aimed to determine if Large Language Models (LLMs) can supplement traditional feedback by identifying missed diagnoses in radiology residents' preliminary reports.</p><p><strong>Materials & methods: </strong>A randomly selected subset of 500 (250 train/250 validation) paired preliminary and final reports between 12/17/2022 and 5/22/2023 were extracted and de-identified from our institutional database. The prompts and report text were input into the GPT-4 language model via the GPT-4 API (gpt-4-0314 model version). Iterative prompt tuning was used on a subset of the training/validation sets to direct the model to identify important findings in the final report that were absent in preliminary reports. For testing, a subset of 10 reports with confirmed diagnostic errors were randomly selected. Fourteen residents with on-call experience assessed the LLM-generated discrepancies and completed a survey on their experience using a 5-point Likert scale.</p><p><strong>Results: </strong>The model identified 24 unique missed diagnoses across 10 test reports with i% model prediction accuracy as rated by 14 residents. Five additional diagnoses were identified by users, resulting in a model sensitivity of 79.2 %. Post-evaluation surveys showed a mean satisfaction rating of 3.50 and perceived accuracy rating of 3.64 out of 5 for LLM-generated feedback. Most respondents (71.4 %) favored a combination of LLM-generated and traditional feedback.</p><p><strong>Conclusion: </strong>This pilot study on the use of LLM-generated feedback for radiology resident preliminary reports demonstrated notable accuracy in identifying missed diagnoses and was positively received, highlighting LLMs' potential role in supplementing conventional feedback methods.</p>","PeriodicalId":93969,"journal":{"name":"Current problems in diagnostic radiology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current problems in diagnostic radiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1067/j.cpradiol.2024.08.003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Rationale and objectives: Radiology residents often receive limited feedback on preliminary reports issued during independent call. This study aimed to determine if Large Language Models (LLMs) can supplement traditional feedback by identifying missed diagnoses in radiology residents' preliminary reports.
Materials & methods: A randomly selected subset of 500 (250 train/250 validation) paired preliminary and final reports between 12/17/2022 and 5/22/2023 were extracted and de-identified from our institutional database. The prompts and report text were input into the GPT-4 language model via the GPT-4 API (gpt-4-0314 model version). Iterative prompt tuning was used on a subset of the training/validation sets to direct the model to identify important findings in the final report that were absent in preliminary reports. For testing, a subset of 10 reports with confirmed diagnostic errors were randomly selected. Fourteen residents with on-call experience assessed the LLM-generated discrepancies and completed a survey on their experience using a 5-point Likert scale.
Results: The model identified 24 unique missed diagnoses across 10 test reports with i% model prediction accuracy as rated by 14 residents. Five additional diagnoses were identified by users, resulting in a model sensitivity of 79.2 %. Post-evaluation surveys showed a mean satisfaction rating of 3.50 and perceived accuracy rating of 3.64 out of 5 for LLM-generated feedback. Most respondents (71.4 %) favored a combination of LLM-generated and traditional feedback.
Conclusion: This pilot study on the use of LLM-generated feedback for radiology resident preliminary reports demonstrated notable accuracy in identifying missed diagnoses and was positively received, highlighting LLMs' potential role in supplementing conventional feedback methods.