Using large language models (LLMs) to apply analytic rubrics to score post-encounter notes.

IF 3.3 2区教育学 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

Medical Teacher Pub Date : 2025-05-17 DOI:10.1080/0142159X.2025.2504106

Christopher Runyon

{"title":"Using large language models (LLMs) to apply analytic rubrics to score post-encounter notes.","authors":"Christopher Runyon","doi":"10.1080/0142159X.2025.2504106","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) show promise in medical education. This study examines LLMs' ability to score post-encounter notes (PNs) from Objective Structured Clinical Examinations (OSCEs) using an analytic rubric. The goal was to evaluate and refine methods for accurate, consistent scoring.Methods: Seven LLMs scored five PNs representing varying levels of performance, including an intentionally incorrect PN. An iterative experimental design tested different prompting strategies and temperature settings, a parameter controlling LLM response creativity. Scores were compared to expected rubric-based results.Results: Consistently accurate scoring required multiple rounds of prompt refinement. Simple prompting led to high variability, which improved with structured approaches and low-temperature settings. LLMs occasionally made errors calculating total scores, necessitating external calculation. The final approach yielded consistently accurate scores across all models.Conclusions: LLMs can reliably apply analytic rubrics to PNs with careful prompt engineering and process refinement. This study illustrates their potential as scalable, automated scoring tools in medical education, though further research is needed to explore their use with holistic rubrics. These findings demonstrate the utility of LLMs in assessment practices.","PeriodicalId":18643,"journal":{"name":"Medical Teacher","volume":" ","pages":"1-9"},"PeriodicalIF":3.3000,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Teacher","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1080/0142159X.2025.2504106","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) show promise in medical education. This study examines LLMs' ability to score post-encounter notes (PNs) from Objective Structured Clinical Examinations (OSCEs) using an analytic rubric. The goal was to evaluate and refine methods for accurate, consistent scoring.

Methods: Seven LLMs scored five PNs representing varying levels of performance, including an intentionally incorrect PN. An iterative experimental design tested different prompting strategies and temperature settings, a parameter controlling LLM response creativity. Scores were compared to expected rubric-based results.

Results: Consistently accurate scoring required multiple rounds of prompt refinement. Simple prompting led to high variability, which improved with structured approaches and low-temperature settings. LLMs occasionally made errors calculating total scores, necessitating external calculation. The final approach yielded consistently accurate scores across all models.

Conclusions: LLMs can reliably apply analytic rubrics to PNs with careful prompt engineering and process refinement. This study illustrates their potential as scalable, automated scoring tools in medical education, though further research is needed to explore their use with holistic rubrics. These findings demonstrate the utility of LLMs in assessment practices.

查看原文本刊更多论文

使用大型语言模型（llm）应用分析规则对相遇后的笔记进行评分。

背景：大型语言模型（LLMs）在医学教育中显示出前景。本研究使用分析标题检验法学硕士从客观结构化临床检查（oses）中获得相遇后笔记（PNs）评分的能力。目标是评估和改进准确、一致的评分方法。方法：7名法学硕士得分5个PN，代表不同的表现水平，包括一个故意错误的PN。迭代实验设计测试了不同的提示策略和温度设置，这是控制LLM响应创造力的参数。将分数与基于评分的预期结果进行比较。结果：持续准确的评分需要多轮的及时改进。简单的提示导致高度可变性，通过结构化方法和低温设置可以改善这种可变性。法学硕士偶尔会在计算总分时出现错误，需要进行外部计算。最终的方法在所有模型中产生了一致的准确分数。结论：llm可以可靠地将分析规则应用于PNs，并及时进行仔细的工程和工艺改进。这项研究说明了它们在医学教育中作为可扩展的自动化评分工具的潜力，尽管需要进一步的研究来探索它们与整体规则的使用。这些发现证明了法学硕士在评估实践中的效用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical Teacher 医学-卫生保健

CiteScore

7.80

自引率

8.50%

发文量

396

审稿时长

3-6 weeks

期刊介绍： Medical Teacher provides accounts of new teaching methods, guidance on structuring courses and assessing achievement, and serves as a forum for communication between medical teachers and those involved in general education. In particular, the journal recognizes the problems teachers have in keeping up-to-date with the developments in educational methods that lead to more effective teaching and learning at a time when the content of the curriculum—from medical procedures to policy changes in health care provision—is also changing. The journal features reports of innovation and research in medical education, case studies, survey articles, practical guidelines, reviews of current literature and book reviews. All articles are peer reviewed.