Zier Zhou, Arsalan Rizwan, Nick Rogoza, Andrew D Chung, Benjamin Ym Kwan
{"title":"Differentiating between GPT-generated and human-written feedback for radiology residents.","authors":"Zier Zhou, Arsalan Rizwan, Nick Rogoza, Andrew D Chung, Benjamin Ym Kwan","doi":"10.1067/j.cpradiol.2025.02.002","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.</p><p><strong>Methods: </strong>Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019-2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.</p><p><strong>Results: </strong>Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (k=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).</p><p><strong>Conclusion: </strong>Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.</p>","PeriodicalId":93969,"journal":{"name":"Current problems in diagnostic radiology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current problems in diagnostic radiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1067/j.cpradiol.2025.02.002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.
Methods: Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019-2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.
Results: Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (k=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).
Conclusion: Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.