Using a Large Language Model to Extract Information from Student Submitted Free-Text Feedback.

IF 1.8 Q2 EDUCATION, SCIENTIFIC DISCIPLINES

Medical Science Educator Pub Date : 2026-01-08 eCollection Date: 2026-02-01 DOI:10.1007/s40670-025-02615-1

Nikola Košćica, Colleen Gillespie, Tyler Webster, Suparna Sarkar, Michael Poles

{"title":"Using a Large Language Model to Extract Information from Student Submitted Free-Text Feedback.","authors":"Nikola Košćica, Colleen Gillespie, Tyler Webster, Suparna Sarkar, Michael Poles","doi":"10.1007/s40670-025-02615-1","DOIUrl":null,"url":null,"abstract":"Student feedback is essential to curriculum evaluation. While methods for analyzing quantitative feedback data are readily available and easy to implement, methods for analyzing text-based, qualitative feedback data are less widely available, requiring more time, effort, and expertise. And yet, students' responses to open-ended questions hold great value for curriculum refinement because narrative comments can identify un-anticipated areas of concern that closed-ended rating scales might miss and often provide specific suggestions for improvement. In this paper, we describe efforts to analyze the feasibility and accuracy of using a Large Language Model (ChatGPT 4o) to analyze medical student comments in response to a question asking them to identify basic science topics they found challenging. ChatGPT 4o was used to categorize and summarize students' identification of and explanations for these challenging topics. We describe the specific prompts used to generate and refine results and then conducted a series of experiments to explore consistency, accuracy, and meaningfulness: (1) reviewing the consistency of 10 replications of the ChatGPT 4o request; (2) comparing \"expert\" human ratings of topic categories with ChatGPT's categorization; and (3) comparing \"expert\" human analyses of the explanations for a challenging topic with those generated by ChatGPT. Overall, we found the LLM output to be useful, fairly closely aligned with human experts, and easy to implement. However, results were not perfectly replicated across multiple trials and we found some differences between human and LLM analyses. Our use case is well suited to the current capabilities of genAI models in that summaries can be rapidly and easily generated with sufficient (but not perfect) consistency and accuracy to support continuous quality improvement of basic science curriculum.Supplementary information: The online version contains supplementary material available at 10.1007/s40670-025-02615-1.","PeriodicalId":37113,"journal":{"name":"Medical Science Educator","volume":"36 1","pages":"63-72"},"PeriodicalIF":1.8000,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13043980/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Science Educator","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s40670-025-02615-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Student feedback is essential to curriculum evaluation. While methods for analyzing quantitative feedback data are readily available and easy to implement, methods for analyzing text-based, qualitative feedback data are less widely available, requiring more time, effort, and expertise. And yet, students' responses to open-ended questions hold great value for curriculum refinement because narrative comments can identify un-anticipated areas of concern that closed-ended rating scales might miss and often provide specific suggestions for improvement. In this paper, we describe efforts to analyze the feasibility and accuracy of using a Large Language Model (ChatGPT 4o) to analyze medical student comments in response to a question asking them to identify basic science topics they found challenging. ChatGPT 4o was used to categorize and summarize students' identification of and explanations for these challenging topics. We describe the specific prompts used to generate and refine results and then conducted a series of experiments to explore consistency, accuracy, and meaningfulness: (1) reviewing the consistency of 10 replications of the ChatGPT 4o request; (2) comparing "expert" human ratings of topic categories with ChatGPT's categorization; and (3) comparing "expert" human analyses of the explanations for a challenging topic with those generated by ChatGPT. Overall, we found the LLM output to be useful, fairly closely aligned with human experts, and easy to implement. However, results were not perfectly replicated across multiple trials and we found some differences between human and LLM analyses. Our use case is well suited to the current capabilities of genAI models in that summaries can be rapidly and easily generated with sufficient (but not perfect) consistency and accuracy to support continuous quality improvement of basic science curriculum.

Supplementary information: The online version contains supplementary material available at 10.1007/s40670-025-02615-1.

查看原文本刊更多论文

使用大型语言模型从学生提交的自由文本反馈中提取信息。

学生的反馈对课程评估至关重要。虽然分析定量反馈数据的方法很容易获得，也很容易实现，但分析基于文本的定性反馈数据的方法却不太广泛，需要更多的时间、精力和专业知识。然而，学生对开放式问题的回答对课程的改进具有很大的价值，因为叙述性评论可以识别出封闭式评分量表可能遗漏的意想不到的关注领域，并经常提供具体的改进建议。在本文中，我们描述了分析使用大型语言模型（ChatGPT 40）分析医学生评论的可行性和准确性的努力，以回应要求他们确定他们认为具有挑战性的基础科学主题的问题。ChatGPT 40用于分类和总结学生对这些具有挑战性的主题的识别和解释。我们描述了用于生成和改进结果的特定提示符，然后进行了一系列实验来探索一致性、准确性和意义：(1)审查了ChatGPT 40请求的10个副本的一致性；(2)将“专家”对主题类别的人类评级与ChatGPT的分类进行比较；(3)将“专家”对具有挑战性的主题的解释分析与ChatGPT生成的分析进行比较。总的来说，我们发现LLM的输出是有用的，与人类专家相当接近，并且易于实现。然而，结果并没有在多个试验中得到完美的复制，我们发现人类和LLM分析之间存在一些差异。我们的用例非常适合genAI模型的当前功能，因为总结可以快速轻松地生成，具有足够的（但不是完美的）一致性和准确性，以支持基础科学课程的持续质量改进。补充资料：在线版本包含补充资料，下载地址：10.1007/s40670-025-02615-1。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical Science Educator Social Sciences-Education

CiteScore

2.90

自引率

11.80%

发文量

202

期刊介绍： Medical Science Educator is the successor of the journal JIAMSE. It is the peer-reviewed publication of the International Association of Medical Science Educators (IAMSE). The Journal offers all who teach in healthcare the most current information to succeed in their task by publishing scholarly activities, opinions, and resources in medical science education. Published articles focus on teaching the sciences fundamental to modern medicine and health, and include basic science education, clinical teaching, and the use of modern education technologies. The Journal provides the readership a better understanding of teaching and learning techniques in order to advance medical science education.