Can Artificial Intelligence Deceive Residency Committees? A Randomized Multicenter Analysis of Letters of Recommendation.

IF 2.6 2区医学 Q1 ORTHOPEDICS

Journal of the American Academy of Orthopaedic Surgeons Pub Date : 2025-03-15 Epub Date: 2024-12-12 DOI:10.5435/JAAOS-D-24-00438

Samuel K Simister, Eric G Huish, Eugene Y Tsai, Hai V Le, Andrea Halim, Dominick Tuason, John P Meehan, Holly B Leshikar, Augustine M Saiz, Zachary C Lum

{"title":"Can Artificial Intelligence Deceive Residency Committees? A Randomized Multicenter Analysis of Letters of Recommendation.","authors":"Samuel K Simister, Eric G Huish, Eugene Y Tsai, Hai V Le, Andrea Halim, Dominick Tuason, John P Meehan, Holly B Leshikar, Augustine M Saiz, Zachary C Lum","doi":"10.5435/JAAOS-D-24-00438","DOIUrl":null,"url":null,"abstract":"Introduction: The introduction of generative artificial intelligence (AI) may have a profound effect on residency applications. In this study, we explore the abilities of AI-generated letters of recommendation (LORs) by evaluating the accuracy of orthopaedic surgery residency selection committee members to identify LORs written by human or AI authors.Methods: In a multicenter, single-blind trial, a total of 45 LORs (15 human, 15 ChatGPT, and 15 Google BARD) were curated. In a random fashion, seven faculty reviewers from four residency programs were asked to grade each of the 45 LORs based on the 11 characteristics outlined in the American Orthopaedic Associations standardized LOR, as well as a 1 to 10 scale on how they would rank the applicant, their desire of having the applicant in the program, and if they thought the letter was generated by a human or AI author. Analysis included descriptives, ordinal regression, and a receiver operator characteristic curve to compare accuracy based on the number of letters reviewed.Results: Faculty reviewers correctly identified 40% (42/105) of human-generated and 63% (132/210) of AI-generated letters ( P < 0.001), which did not increase over time (AUC 0.451, P = 0.102). When analyzed by perceived author, letters marked as human generated had significantly higher means for all variables ( P = 0.01). BARD did markedly better than human authors in accuracy (3.25 [1.79 to 5.92], P < 0.001), adaptability (1.29 [1.02 to 1.65], P = 0.034), and perceived commitment (1.56 [0.99 to 2.47], P < 0.055). Additional analysis controlling for reviewer background showed no differences in outcomes based on experience or familiarity with the AI programs.Conclusion: Faculty members were unsuccessful in determining the difference between human-generated and AI-generated LORs 50% of the time, which suggests that AI can generate LORs similarly to human authors. This highlights the importance for selection committees to reconsider the role and influence of LORs on residency applications.","PeriodicalId":51098,"journal":{"name":"Journal of the American Academy of Orthopaedic Surgeons","volume":" ","pages":"e348-e355"},"PeriodicalIF":2.6000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Academy of Orthopaedic Surgeons","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.5435/JAAOS-D-24-00438","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: The introduction of generative artificial intelligence (AI) may have a profound effect on residency applications. In this study, we explore the abilities of AI-generated letters of recommendation (LORs) by evaluating the accuracy of orthopaedic surgery residency selection committee members to identify LORs written by human or AI authors.

Methods: In a multicenter, single-blind trial, a total of 45 LORs (15 human, 15 ChatGPT, and 15 Google BARD) were curated. In a random fashion, seven faculty reviewers from four residency programs were asked to grade each of the 45 LORs based on the 11 characteristics outlined in the American Orthopaedic Associations standardized LOR, as well as a 1 to 10 scale on how they would rank the applicant, their desire of having the applicant in the program, and if they thought the letter was generated by a human or AI author. Analysis included descriptives, ordinal regression, and a receiver operator characteristic curve to compare accuracy based on the number of letters reviewed.

Results: Faculty reviewers correctly identified 40% (42/105) of human-generated and 63% (132/210) of AI-generated letters ( P < 0.001), which did not increase over time (AUC 0.451, P = 0.102). When analyzed by perceived author, letters marked as human generated had significantly higher means for all variables ( P = 0.01). BARD did markedly better than human authors in accuracy (3.25 [1.79 to 5.92], P < 0.001), adaptability (1.29 [1.02 to 1.65], P = 0.034), and perceived commitment (1.56 [0.99 to 2.47], P < 0.055). Additional analysis controlling for reviewer background showed no differences in outcomes based on experience or familiarity with the AI programs.

Conclusion: Faculty members were unsuccessful in determining the difference between human-generated and AI-generated LORs 50% of the time, which suggests that AI can generate LORs similarly to human authors. This highlights the importance for selection committees to reconsider the role and influence of LORs on residency applications.

查看原文本刊更多论文

人工智能能欺骗居民委员会吗？推荐信的随机多中心分析。

导读：生成式人工智能（AI）的引入可能会对居留申请产生深远的影响。在本研究中，我们通过评估骨科住院医师选择委员会成员识别由人类或人工智能作者撰写的推荐信的准确性，探索人工智能生成的推荐信（LORs）的能力。方法：在一项多中心单盲试验中，共筛选了45例LORs（15例人类，15例ChatGPT和15例谷歌BARD）。来自4个住院医师项目的7名教员审稿人被随机要求根据美国骨科协会（American Orthopaedic Associations）标准化LOR中列出的11个特征，以及他们对申请人的评分、他们希望申请人参加该项目的意愿，以及他们认为这封信是由人类还是人工智能作者写的，对45封LOR进行评分。分析包括描述、有序回归和接收者操作员特征曲线，以比较基于审查字母数量的准确性。结果：教师审稿人正确识别了40%（42/105）的人工生成字母和63%（132/210）的人工生成字母（P < 0.001），这一比例不随时间增加（AUC 0.451, P = 0.102）。当通过感知作者进行分析时，标记为人为生成的字母在所有变量中具有显著更高的平均值（P = 0.01）。在准确性（3.25 [1.79 ~ 5.92],P < 0.001）、适应性（1.29 [1.02 ~ 1.65],P = 0.034）和感知承诺（1.56 [0.99 ~ 2.47],P < 0.055）方面，BARD显著优于人类作者。控制审稿人背景的额外分析显示，基于经验或对人工智能程序的熟悉程度，结果没有差异。结论：教师在50%的时间内无法确定人类生成的LORs和AI生成的LORs之间的差异，这表明AI可以生成与人类作者相似的LORs。这突出表明，遴选委员会必须重新考虑实习律师对居留申请的作用和影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the American Academy of Orthopaedic Surgeons 医学-整形外科

CiteScore

6.10

自引率

6.20%

发文量

529

审稿时长

4-8 weeks

期刊介绍： The Journal of the American Academy of Orthopaedic Surgeons was established in the fall of 1993 by the Academy in response to its membership’s demand for a clinical review journal. Two issues were published the first year, followed by six issues yearly from 1994 through 2004. In September 2005, JAAOS began publishing monthly issues. Each issue includes richly illustrated peer-reviewed articles focused on clinical diagnosis and management. Special features in each issue provide commentary on developments in pharmacotherapeutics, materials and techniques, and computer applications.