使用人工智能评估普通外科住院医师申请个人陈述的可行性。

IF 2.1

Journal of surgical education Pub Date : 2025-08-29 DOI:10.1016/j.jsurg.2025.103655

Pooja M Varman, Shadae Nicholas, Andrew Conner, Ajita S Prabhu, Judith C French, Jeremy M Lipman

{"title":"使用人工智能评估普通外科住院医师申请个人陈述的可行性。","authors":"Pooja M Varman, Shadae Nicholas, Andrew Conner, Ajita S Prabhu, Judith C French, Jeremy M Lipman","doi":"10.1016/j.jsurg.2025.103655","DOIUrl":null,"url":null,"abstract":"Objective: As artificial intelligence (AI) becomes increasingly integrated into graduate medical education, residency programs are exploring AI's role in application screening. Personal statements (PSs) remain a highly subjective yet influential component of the residency application. This study assesses the feasibility of using a large language model (LLM) to evaluate general surgery residency PSs compared to human-assigned scores.Design, setting, and participants: We conducted a retrospective analysis of 668 deidentified PSs submitted to our general surgery residency program during the 2023-2024 application cycle. PSs were originally scored by human assessors (HA) using an anchored 1-5 scale in two domains: leadership and pathway. Each PS was subsequently scored by GPT-3.5 (AI) using the same rubric and standardized prompts. Descriptive statistics were used to compare AI and HA scores. Inter-rater agreement was assessed using weighted kappa coefficients. Discrepant cases (score differences >2 points) were reviewed qualitatively to identify scoring themes.Results: AI and HA scoring showed low agreement: κ = 0.184 for leadership and κ = 0.120 for pathway domains. Median AI leadership scores were lower (3 [IQR 2-4]) than HA scores (4 [IQR 3-5]), while AI pathway scores were higher (4 [IQR 4-5]) than HA scores (3 [IQR 3-4]). Qualitative review revealed that AI required explicit labeling (e.g., formal leadership titles or stated adversity) to assign higher scores, whereas HA rewarded inferred qualities such as resilience, passion, and longitudinal commitment.Conclusions: AI applied rubric-based scoring consistently but interpreted narrative content differently than human reviewers. While AI may enhance consistency and scalability in early application screening, its limitations in recognizing implicit meaning suggest human judgment remains essential for evaluating nuanced or inferential content. Caution should be exercised in adopting AI tools for subjective application review.","PeriodicalId":94109,"journal":{"name":"Journal of surgical education","volume":" ","pages":"103655"},"PeriodicalIF":2.1000,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feasibility of Using AI to Evaluate General Surgery Residency Application Personal Statements.\",\"authors\":\"Pooja M Varman, Shadae Nicholas, Andrew Conner, Ajita S Prabhu, Judith C French, Jeremy M Lipman\",\"doi\":\"10.1016/j.jsurg.2025.103655\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: As artificial intelligence (AI) becomes increasingly integrated into graduate medical education, residency programs are exploring AI's role in application screening. Personal statements (PSs) remain a highly subjective yet influential component of the residency application. This study assesses the feasibility of using a large language model (LLM) to evaluate general surgery residency PSs compared to human-assigned scores.Design, setting, and participants: We conducted a retrospective analysis of 668 deidentified PSs submitted to our general surgery residency program during the 2023-2024 application cycle. PSs were originally scored by human assessors (HA) using an anchored 1-5 scale in two domains: leadership and pathway. Each PS was subsequently scored by GPT-3.5 (AI) using the same rubric and standardized prompts. Descriptive statistics were used to compare AI and HA scores. Inter-rater agreement was assessed using weighted kappa coefficients. Discrepant cases (score differences >2 points) were reviewed qualitatively to identify scoring themes.Results: AI and HA scoring showed low agreement: κ = 0.184 for leadership and κ = 0.120 for pathway domains. Median AI leadership scores were lower (3 [IQR 2-4]) than HA scores (4 [IQR 3-5]), while AI pathway scores were higher (4 [IQR 4-5]) than HA scores (3 [IQR 3-4]). Qualitative review revealed that AI required explicit labeling (e.g., formal leadership titles or stated adversity) to assign higher scores, whereas HA rewarded inferred qualities such as resilience, passion, and longitudinal commitment.Conclusions: AI applied rubric-based scoring consistently but interpreted narrative content differently than human reviewers. While AI may enhance consistency and scalability in early application screening, its limitations in recognizing implicit meaning suggest human judgment remains essential for evaluating nuanced or inferential content. Caution should be exercised in adopting AI tools for subjective application review.\",\"PeriodicalId\":94109,\"journal\":{\"name\":\"Journal of surgical education\",\"volume\":\" \",\"pages\":\"103655\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of surgical education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.jsurg.2025.103655\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of surgical education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.jsurg.2025.103655","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

目的：随着人工智能（AI）越来越多地融入研究生医学教育，住院医师项目正在探索AI在申请筛选中的作用。个人陈述（ps）仍然是居留申请中非常主观但有影响力的组成部分。本研究评估了使用大型语言模型（LLM）来评估普通外科住院医师ps与人工评分的可行性。设计、环境和参与者：我们对2023-2024年申请周期内提交给普外科住院医师项目的668名未确定的PSs进行了回顾性分析。ps最初由人类评估员（HA）在两个领域（领导力和路径）使用锚定的1-5量表进行评分。每个PS随后使用GPT-3.5 （AI）使用相同的标题和标准化提示进行评分。采用描述性统计比较AI和HA评分。使用加权kappa系数评估评分者之间的一致性。对不一致的案例（得分差异bb0 - 2分）进行定性审查，以确定评分主题。结果：AI和HA评分一致性较低：κ = 0.184,κ = 0.120。AI领导力得分中位数（3 [IQR 2-4]）低于HA得分中位数（4 [IQR 3-5]），而AI路径得分中位数（4 [IQR 4-5]）高于HA得分中位数（3 [IQR 3-4]）。定性评估显示，人工智能需要明确的标签（例如，正式的领导头衔或陈述的逆境）来分配更高的分数，而HA奖励推断的品质，如弹性、激情和纵向承诺。结论：AI始终应用基于规则的评分，但对叙事内容的解释与人类评论者不同。虽然人工智能可以增强早期应用程序筛选的一致性和可扩展性，但它在识别隐含含义方面的局限性表明，人类的判断对于评估细微差别或推论内容仍然至关重要。在采用人工智能工具进行主观应用审查时应谨慎行事。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Feasibility of Using AI to Evaluate General Surgery Residency Application Personal Statements.

Objective: As artificial intelligence (AI) becomes increasingly integrated into graduate medical education, residency programs are exploring AI's role in application screening. Personal statements (PSs) remain a highly subjective yet influential component of the residency application. This study assesses the feasibility of using a large language model (LLM) to evaluate general surgery residency PSs compared to human-assigned scores.

Design, setting, and participants: We conducted a retrospective analysis of 668 deidentified PSs submitted to our general surgery residency program during the 2023-2024 application cycle. PSs were originally scored by human assessors (HA) using an anchored 1-5 scale in two domains: leadership and pathway. Each PS was subsequently scored by GPT-3.5 (AI) using the same rubric and standardized prompts. Descriptive statistics were used to compare AI and HA scores. Inter-rater agreement was assessed using weighted kappa coefficients. Discrepant cases (score differences >2 points) were reviewed qualitatively to identify scoring themes.

Results: AI and HA scoring showed low agreement: κ = 0.184 for leadership and κ = 0.120 for pathway domains. Median AI leadership scores were lower (3 [IQR 2-4]) than HA scores (4 [IQR 3-5]), while AI pathway scores were higher (4 [IQR 4-5]) than HA scores (3 [IQR 3-4]). Qualitative review revealed that AI required explicit labeling (e.g., formal leadership titles or stated adversity) to assign higher scores, whereas HA rewarded inferred qualities such as resilience, passion, and longitudinal commitment.

Conclusions: AI applied rubric-based scoring consistently but interpreted narrative content differently than human reviewers. While AI may enhance consistency and scalability in early application screening, its limitations in recognizing implicit meaning suggest human judgment remains essential for evaluating nuanced or inferential content. Caution should be exercised in adopting AI tools for subjective application review.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of surgical education

自引率

0.00%

发文量