基于词汇的医学生临床表现情感分析的效度评价

IF 5.2 1区教育学 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

Medical Education Pub Date : 2026-06-01 Epub Date: 2025-11-25 DOI:10.1111/medu.70101

Irene Ma, Mike Paget, Janeve Desy, Adrian Harvey, Glenda Bendiak, Christopher Naugler, Kevin McLaughlin

{"title":"基于词汇的医学生临床表现情感分析的效度评价","authors":"Irene Ma, Mike Paget, Janeve Desy, Adrian Harvey, Glenda Bendiak, Christopher Naugler, Kevin McLaughlin","doi":"10.1111/medu.70101","DOIUrl":null,"url":null,"abstract":"Introduction: Assessment of clinical performance has traditionally been a numbers game based upon Likert scale ratings. But, thanks to advances in the science of natural language processing (NLP), it is now possible to incorporate rich narrative data into assessment. In this study, our objective was to evaluate the validity of lexicon-based sentiment analysis of medical students' clinical performance from in-training evaluation reports (ITERs) with a view to fully integrating this as a machine-based process into future assessment decisions.Methods: This was a mixed methods, retrospective derivation/validation cohort study structured around Kane's validity framework. We used content analysis to create a lexicon of performance descriptors, performed a G-study, and calculated the positive likelihood ratio (LR+) for descriptors (scoring). To evaluate generalisation, we calculated the intraclass correlation coefficient and compared descriptors in derivation and validation cohorts. We then performed [human] lexicon-based sentiment analysis and compared the number of descriptors of different types between cohorts of highest performing students (HPS) and lowest performing students (LPS).Results: In our G-study, 86.6% of variance was attributed to the student. The ICC between raters for identification of descriptors was 0.93. The mean number of neutral descriptors was similar between HPS and LPS cohorts, but the number of negative descriptors was higher for LPS (11.4 (10.8) versus 1.4 (1.6) for HPS, p < 0.01, d = 1.37) and the number of positive descriptors was higher for HPS (19 (14) versus 1.4 (1.5) for LPS, p < 0.001, d = 1.86).Discussion: In the midst of their busy clinical work schedule, preceptors find time to tell a story about a medical student and these narrative data enrich the assessment portfolio. Based upon our validity argument, we feel there is a role for lexicon-based sentiment analysis of clinical performance descriptors in ITERs and that these results can contribute meaningfully to assessment decisions.","PeriodicalId":18370,"journal":{"name":"Medical Education","volume":" ","pages":"685-692"},"PeriodicalIF":5.2000,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13129623/pdf/","citationCount":"0","resultStr":"{\"title\":\"A validity evaluation of lexicon-based sentiment analysis of medical students' clinical performance from in-training evaluation reports.\",\"authors\":\"Irene Ma, Mike Paget, Janeve Desy, Adrian Harvey, Glenda Bendiak, Christopher Naugler, Kevin McLaughlin\",\"doi\":\"10.1111/medu.70101\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: Assessment of clinical performance has traditionally been a numbers game based upon Likert scale ratings. But, thanks to advances in the science of natural language processing (NLP), it is now possible to incorporate rich narrative data into assessment. In this study, our objective was to evaluate the validity of lexicon-based sentiment analysis of medical students' clinical performance from in-training evaluation reports (ITERs) with a view to fully integrating this as a machine-based process into future assessment decisions.Methods: This was a mixed methods, retrospective derivation/validation cohort study structured around Kane's validity framework. We used content analysis to create a lexicon of performance descriptors, performed a G-study, and calculated the positive likelihood ratio (LR+) for descriptors (scoring). To evaluate generalisation, we calculated the intraclass correlation coefficient and compared descriptors in derivation and validation cohorts. We then performed [human] lexicon-based sentiment analysis and compared the number of descriptors of different types between cohorts of highest performing students (HPS) and lowest performing students (LPS).Results: In our G-study, 86.6% of variance was attributed to the student. The ICC between raters for identification of descriptors was 0.93. The mean number of neutral descriptors was similar between HPS and LPS cohorts, but the number of negative descriptors was higher for LPS (11.4 (10.8) versus 1.4 (1.6) for HPS, p < 0.01, d = 1.37) and the number of positive descriptors was higher for HPS (19 (14) versus 1.4 (1.5) for LPS, p < 0.001, d = 1.86).Discussion: In the midst of their busy clinical work schedule, preceptors find time to tell a story about a medical student and these narrative data enrich the assessment portfolio. Based upon our validity argument, we feel there is a role for lexicon-based sentiment analysis of clinical performance descriptors in ITERs and that these results can contribute meaningfully to assessment decisions.\",\"PeriodicalId\":18370,\"journal\":{\"name\":\"Medical Education\",\"volume\":\" \",\"pages\":\"685-692\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2026-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13129623/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical Education\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1111/medu.70101\",\"RegionNum\":1,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/11/25 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/medu.70101","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/11/25 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

摘要

简介：临床表现的评估传统上是一个基于李克特量表评级的数字游戏。但是，由于自然语言处理（NLP）科学的进步，现在有可能将丰富的叙事数据纳入评估。在这项研究中，我们的目标是评估基于词典的医学生临床表现情绪分析的有效性，以评估培训评估报告（ITERs），以期将其作为基于机器的过程充分整合到未来的评估决策中。方法：这是一个混合方法，回顾性推导/验证队列研究围绕凯恩的效度框架。我们使用内容分析创建了一个性能描述词的词典，进行了g研究，并计算了描述词（得分）的正似然比（LR+）。为了评估泛化，我们计算了类内相关系数，并比较了衍生和验证队列中的描述符。然后，我们进行了基于[人类]词典的情感分析，并比较了表现最好的学生（HPS）和表现最差的学生（LPS）之间不同类型描述符的数量。结果：在我们的g研究中，86.6%的方差归因于学生。评价者对描述符识别的ICC为0.93。中性描述词的平均数量在HPS组和LPS组之间相似，但负面描述词的数量在LPS组更高(11.4个（10.8个），而HPS组为1.4个（1.6个）。p讨论：在繁忙的临床工作日程中，教师会抽出时间讲述一个医学生的故事，这些叙事数据丰富了评估组合。基于我们的有效性论证，我们认为在iter中有一个基于词典的临床表现描述符情感分析的角色，这些结果可以为评估决策做出有意义的贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A validity evaluation of lexicon-based sentiment analysis of medical students' clinical performance from in-training evaluation reports.

Introduction: Assessment of clinical performance has traditionally been a numbers game based upon Likert scale ratings. But, thanks to advances in the science of natural language processing (NLP), it is now possible to incorporate rich narrative data into assessment. In this study, our objective was to evaluate the validity of lexicon-based sentiment analysis of medical students' clinical performance from in-training evaluation reports (ITERs) with a view to fully integrating this as a machine-based process into future assessment decisions.

Methods: This was a mixed methods, retrospective derivation/validation cohort study structured around Kane's validity framework. We used content analysis to create a lexicon of performance descriptors, performed a G-study, and calculated the positive likelihood ratio (LR+) for descriptors (scoring). To evaluate generalisation, we calculated the intraclass correlation coefficient and compared descriptors in derivation and validation cohorts. We then performed [human] lexicon-based sentiment analysis and compared the number of descriptors of different types between cohorts of highest performing students (HPS) and lowest performing students (LPS).

Results: In our G-study, 86.6% of variance was attributed to the student. The ICC between raters for identification of descriptors was 0.93. The mean number of neutral descriptors was similar between HPS and LPS cohorts, but the number of negative descriptors was higher for LPS (11.4 (10.8) versus 1.4 (1.6) for HPS, p < 0.01, d = 1.37) and the number of positive descriptors was higher for HPS (19 (14) versus 1.4 (1.5) for LPS, p < 0.001, d = 1.86).

Discussion: In the midst of their busy clinical work schedule, preceptors find time to tell a story about a medical student and these narrative data enrich the assessment portfolio. Based upon our validity argument, we feel there is a role for lexicon-based sentiment analysis of clinical performance descriptors in ITERs and that these results can contribute meaningfully to assessment decisions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Medical Education 医学-卫生保健

CiteScore

8.40

自引率

10.00%

发文量

279

审稿时长

4-8 weeks

期刊介绍： Medical Education seeks to be the pre-eminent journal in the field of education for health care professionals, and publishes material of the highest quality, reflecting world wide or provocative issues and perspectives. The journal welcomes high quality papers on all aspects of health professional education including; -undergraduate education -postgraduate training -continuing professional development -interprofessional education