Irene Ma, Mike Paget, Janeve Desy, Adrian Harvey, Glenda Bendiak, Christopher Naugler, Kevin McLaughlin
{"title":"基于词汇的医学生临床表现情感分析的效度评价","authors":"Irene Ma, Mike Paget, Janeve Desy, Adrian Harvey, Glenda Bendiak, Christopher Naugler, Kevin McLaughlin","doi":"10.1111/medu.70101","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Assessment of clinical performance has traditionally been a numbers game based upon Likert scale ratings. But, thanks to advances in the science of natural language processing (NLP), it is now possible to incorporate rich narrative data into assessment. In this study, our objective was to evaluate the validity of lexicon-based sentiment analysis of medical students' clinical performance from in-training evaluation reports (ITERs) with a view to fully integrating this as a machine-based process into future assessment decisions.</p><p><strong>Methods: </strong>This was a mixed methods, retrospective derivation/validation cohort study structured around Kane's validity framework. We used content analysis to create a lexicon of performance descriptors, performed a G-study, and calculated the positive likelihood ratio (LR+) for descriptors (scoring). To evaluate generalisation, we calculated the intraclass correlation coefficient and compared descriptors in derivation and validation cohorts. We then performed [human] lexicon-based sentiment analysis and compared the number of descriptors of different types between cohorts of highest performing students (HPS) and lowest performing students (LPS).</p><p><strong>Results: </strong>In our G-study, 86.6% of variance was attributed to the student. The ICC between raters for identification of descriptors was 0.93. The mean number of neutral descriptors was similar between HPS and LPS cohorts, but the number of negative descriptors was higher for LPS (11.4 (10.8) versus 1.4 (1.6) for HPS, p < 0.01, d = 1.37) and the number of positive descriptors was higher for HPS (19 (14) versus 1.4 (1.5) for LPS, p < 0.001, d = 1.86).</p><p><strong>Discussion: </strong>In the midst of their busy clinical work schedule, preceptors find time to tell a story about a medical student and these narrative data enrich the assessment portfolio. Based upon our validity argument, we feel there is a role for lexicon-based sentiment analysis of clinical performance descriptors in ITERs and that these results can contribute meaningfully to assessment decisions.</p>","PeriodicalId":18370,"journal":{"name":"Medical Education","volume":" ","pages":"685-692"},"PeriodicalIF":5.2000,"publicationDate":"2026-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13129623/pdf/","citationCount":"0","resultStr":"{\"title\":\"A validity evaluation of lexicon-based sentiment analysis of medical students' clinical performance from in-training evaluation reports.\",\"authors\":\"Irene Ma, Mike Paget, Janeve Desy, Adrian Harvey, Glenda Bendiak, Christopher Naugler, Kevin McLaughlin\",\"doi\":\"10.1111/medu.70101\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>Assessment of clinical performance has traditionally been a numbers game based upon Likert scale ratings. But, thanks to advances in the science of natural language processing (NLP), it is now possible to incorporate rich narrative data into assessment. In this study, our objective was to evaluate the validity of lexicon-based sentiment analysis of medical students' clinical performance from in-training evaluation reports (ITERs) with a view to fully integrating this as a machine-based process into future assessment decisions.</p><p><strong>Methods: </strong>This was a mixed methods, retrospective derivation/validation cohort study structured around Kane's validity framework. We used content analysis to create a lexicon of performance descriptors, performed a G-study, and calculated the positive likelihood ratio (LR+) for descriptors (scoring). To evaluate generalisation, we calculated the intraclass correlation coefficient and compared descriptors in derivation and validation cohorts. We then performed [human] lexicon-based sentiment analysis and compared the number of descriptors of different types between cohorts of highest performing students (HPS) and lowest performing students (LPS).</p><p><strong>Results: </strong>In our G-study, 86.6% of variance was attributed to the student. The ICC between raters for identification of descriptors was 0.93. The mean number of neutral descriptors was similar between HPS and LPS cohorts, but the number of negative descriptors was higher for LPS (11.4 (10.8) versus 1.4 (1.6) for HPS, p < 0.01, d = 1.37) and the number of positive descriptors was higher for HPS (19 (14) versus 1.4 (1.5) for LPS, p < 0.001, d = 1.86).</p><p><strong>Discussion: </strong>In the midst of their busy clinical work schedule, preceptors find time to tell a story about a medical student and these narrative data enrich the assessment portfolio. Based upon our validity argument, we feel there is a role for lexicon-based sentiment analysis of clinical performance descriptors in ITERs and that these results can contribute meaningfully to assessment decisions.</p>\",\"PeriodicalId\":18370,\"journal\":{\"name\":\"Medical Education\",\"volume\":\" \",\"pages\":\"685-692\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2026-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13129623/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical Education\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1111/medu.70101\",\"RegionNum\":1,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/11/25 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/medu.70101","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/11/25 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
A validity evaluation of lexicon-based sentiment analysis of medical students' clinical performance from in-training evaluation reports.
Introduction: Assessment of clinical performance has traditionally been a numbers game based upon Likert scale ratings. But, thanks to advances in the science of natural language processing (NLP), it is now possible to incorporate rich narrative data into assessment. In this study, our objective was to evaluate the validity of lexicon-based sentiment analysis of medical students' clinical performance from in-training evaluation reports (ITERs) with a view to fully integrating this as a machine-based process into future assessment decisions.
Methods: This was a mixed methods, retrospective derivation/validation cohort study structured around Kane's validity framework. We used content analysis to create a lexicon of performance descriptors, performed a G-study, and calculated the positive likelihood ratio (LR+) for descriptors (scoring). To evaluate generalisation, we calculated the intraclass correlation coefficient and compared descriptors in derivation and validation cohorts. We then performed [human] lexicon-based sentiment analysis and compared the number of descriptors of different types between cohorts of highest performing students (HPS) and lowest performing students (LPS).
Results: In our G-study, 86.6% of variance was attributed to the student. The ICC between raters for identification of descriptors was 0.93. The mean number of neutral descriptors was similar between HPS and LPS cohorts, but the number of negative descriptors was higher for LPS (11.4 (10.8) versus 1.4 (1.6) for HPS, p < 0.01, d = 1.37) and the number of positive descriptors was higher for HPS (19 (14) versus 1.4 (1.5) for LPS, p < 0.001, d = 1.86).
Discussion: In the midst of their busy clinical work schedule, preceptors find time to tell a story about a medical student and these narrative data enrich the assessment portfolio. Based upon our validity argument, we feel there is a role for lexicon-based sentiment analysis of clinical performance descriptors in ITERs and that these results can contribute meaningfully to assessment decisions.
期刊介绍:
Medical Education seeks to be the pre-eminent journal in the field of education for health care professionals, and publishes material of the highest quality, reflecting world wide or provocative issues and perspectives.
The journal welcomes high quality papers on all aspects of health professional education including;
-undergraduate education
-postgraduate training
-continuing professional development
-interprofessional education