区分gpt生成的和人为写的放射住院医师反馈。

Zier Zhou, Arsalan Rizwan, Nick Rogoza, Andrew D Chung, Benjamin Ym Kwan
{"title":"区分gpt生成的和人为写的放射住院医师反馈。","authors":"Zier Zhou, Arsalan Rizwan, Nick Rogoza, Andrew D Chung, Benjamin Ym Kwan","doi":"10.1067/j.cpradiol.2025.02.002","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.</p><p><strong>Methods: </strong>Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019-2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.</p><p><strong>Results: </strong>Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (k=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).</p><p><strong>Conclusion: </strong>Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.</p>","PeriodicalId":93969,"journal":{"name":"Current problems in diagnostic radiology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Differentiating between GPT-generated and human-written feedback for radiology residents.\",\"authors\":\"Zier Zhou, Arsalan Rizwan, Nick Rogoza, Andrew D Chung, Benjamin Ym Kwan\",\"doi\":\"10.1067/j.cpradiol.2025.02.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.</p><p><strong>Methods: </strong>Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019-2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.</p><p><strong>Results: </strong>Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (k=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).</p><p><strong>Conclusion: </strong>Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.</p>\",\"PeriodicalId\":93969,\"journal\":{\"name\":\"Current problems in diagnostic radiology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-02-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current problems in diagnostic radiology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1067/j.cpradiol.2025.02.002\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current problems in diagnostic radiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1067/j.cpradiol.2025.02.002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

目的:最近加拿大放射学项目实施的基于能力的医学教育(CBME)要求教师进行更多的评估。CBME中叙述性反馈的兴起,与大型语言模型(llm)的兴起相一致,提出了关于这些模型产生与人类专家和相关挑战相匹配的信息评论的潜力的问题。本研究比较了人类书面反馈和gpt -3.5生成的放射科住院医生反馈,以及评分者如何区分这些来源。方法:评估由28名教师对加拿大诊断放射学项目(2019-2023)的10名住院医生完成。从Elentra中提取评论,去识别,并解析成句子,随机选择110个进行分析。其中11条评论被输入到GPT-3.5中,生成110条合成评论,这些评论与实际评论混合在一起。两名教员评分员和GPT-3.5阅读每条评论,以预测它是人工写的还是gpt生成的。结果:人类的实际评论通常比合成评论更长,更具体,特别是在描述临床程序和患者互动时。当两种反馈类型都同样模糊时,源区分就更加困难。观察到GPT-3.5提供的反应与人类之间的一致性较低(k=-0.237)。与GPT-3.5(50%)相比,人类评分者在识别实际和合成评论方面也更准确(80.5%)。结论:目前,GPT-3.5在为放射科住院医生提供具体、细致的反馈方面无法与人类专家相提并论。与人类相比,GPT-3.5在区分真实评论和合成评论方面的表现也更差。这些见解可以指导更复杂的算法的开发,以产生更高质量的反馈,支持教师的发展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Differentiating between GPT-generated and human-written feedback for radiology residents.

Purpose: Recent competency-based medical education (CBME) implementation within Canadian radiology programs has required faculty to conduct more assessments. The rise of narrative feedback in CBME, coinciding with the rise of large language models (LLMs), raises questions about the potential of these models to generate informative comments matching human experts and associated challenges. This study compares human-written feedback to GPT-3.5-generated feedback for radiology residents, and how well raters can differentiate between these sources.

Methods: Assessments were completed by 28 faculty members for 10 residents within a Canadian Diagnostic Radiology program (2019-2023). Comments were extracted from Elentra, de-identified, and parsed into sentences, of which 110 were randomly selected for analysis. 11 of these comments were entered into GPT-3.5, generating 110 synthetic comments that were mixed with actual comments. Two faculty raters and GPT-3.5 read each comment to predict whether it was human-written or GPT-generated.

Results: Actual comments from humans were often longer and more specific than synthetic comments, especially when describing clinical procedures and patient interactions. Source differentiation was more difficult when both feedback types were similarly vague. Low agreement (k=-0.237) between responses provided by GPT-3.5 and humans was observed. Human raters were also more accurate (80.5 %) at identifying actual and synthetic comments than GPT-3.5 (50 %).

Conclusion: Currently, GPT-3.5 cannot match human experts in delivering specific, nuanced feedback for radiology residents. Compared to humans, GPT-3.5 also performs worse in distinguishing between actual and synthetic comments. These insights could guide the development of more sophisticated algorithms to produce higher-quality feedback, supporting faculty development.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信