Nikola Košćica, Colleen Gillespie, Tyler Webster, Suparna Sarkar, Michael Poles
{"title":"Using a Large Language Model to Extract Information from Student Submitted Free-Text Feedback.","authors":"Nikola Košćica, Colleen Gillespie, Tyler Webster, Suparna Sarkar, Michael Poles","doi":"10.1007/s40670-025-02615-1","DOIUrl":null,"url":null,"abstract":"<p><p>Student feedback is essential to curriculum evaluation. While methods for analyzing quantitative feedback data are readily available and easy to implement, methods for analyzing text-based, qualitative feedback data are less widely available, requiring more time, effort, and expertise. And yet, students' responses to open-ended questions hold great value for curriculum refinement because narrative comments can identify un-anticipated areas of concern that closed-ended rating scales might miss and often provide specific suggestions for improvement. In this paper, we describe efforts to analyze the feasibility and accuracy of using a Large Language Model (ChatGPT 4o) to analyze medical student comments in response to a question asking them to identify basic science topics they found challenging. ChatGPT 4o was used to categorize and summarize students' identification of and explanations for these challenging topics. We describe the specific prompts used to generate and refine results and then conducted a series of experiments to explore consistency, accuracy, and meaningfulness: (1) reviewing the consistency of 10 replications of the ChatGPT 4o request; (2) comparing \"expert\" human ratings of topic categories with ChatGPT's categorization; and (3) comparing \"expert\" human analyses of the explanations for a challenging topic with those generated by ChatGPT. Overall, we found the LLM output to be useful, fairly closely aligned with human experts, and easy to implement. However, results were not perfectly replicated across multiple trials and we found some differences between human and LLM analyses. Our use case is well suited to the current capabilities of genAI models in that summaries can be rapidly and easily generated with sufficient (but not perfect) consistency and accuracy to support continuous quality improvement of basic science curriculum.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s40670-025-02615-1.</p>","PeriodicalId":37113,"journal":{"name":"Medical Science Educator","volume":"36 1","pages":"63-72"},"PeriodicalIF":1.8000,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13043980/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Science Educator","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s40670-025-02615-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Student feedback is essential to curriculum evaluation. While methods for analyzing quantitative feedback data are readily available and easy to implement, methods for analyzing text-based, qualitative feedback data are less widely available, requiring more time, effort, and expertise. And yet, students' responses to open-ended questions hold great value for curriculum refinement because narrative comments can identify un-anticipated areas of concern that closed-ended rating scales might miss and often provide specific suggestions for improvement. In this paper, we describe efforts to analyze the feasibility and accuracy of using a Large Language Model (ChatGPT 4o) to analyze medical student comments in response to a question asking them to identify basic science topics they found challenging. ChatGPT 4o was used to categorize and summarize students' identification of and explanations for these challenging topics. We describe the specific prompts used to generate and refine results and then conducted a series of experiments to explore consistency, accuracy, and meaningfulness: (1) reviewing the consistency of 10 replications of the ChatGPT 4o request; (2) comparing "expert" human ratings of topic categories with ChatGPT's categorization; and (3) comparing "expert" human analyses of the explanations for a challenging topic with those generated by ChatGPT. Overall, we found the LLM output to be useful, fairly closely aligned with human experts, and easy to implement. However, results were not perfectly replicated across multiple trials and we found some differences between human and LLM analyses. Our use case is well suited to the current capabilities of genAI models in that summaries can be rapidly and easily generated with sufficient (but not perfect) consistency and accuracy to support continuous quality improvement of basic science curriculum.
Supplementary information: The online version contains supplementary material available at 10.1007/s40670-025-02615-1.
期刊介绍:
Medical Science Educator is the successor of the journal JIAMSE. It is the peer-reviewed publication of the International Association of Medical Science Educators (IAMSE). The Journal offers all who teach in healthcare the most current information to succeed in their task by publishing scholarly activities, opinions, and resources in medical science education. Published articles focus on teaching the sciences fundamental to modern medicine and health, and include basic science education, clinical teaching, and the use of modern education technologies. The Journal provides the readership a better understanding of teaching and learning techniques in order to advance medical science education.