Comparison of a generative large language model to pharmacy student performance on therapeutics examinations

IF 1.4 Q3 EDUCATION, SCIENTIFIC DISCIPLINES

Currents in Pharmacy Teaching and Learning Pub Date : 2025-05-22 DOI:10.1016/j.cptl.2025.102394

Christopher J. Edwards, Bernadette Cornelison, Brian L. Erstad

{"title":"Comparison of a generative large language model to pharmacy student performance on therapeutics examinations","authors":"Christopher J. Edwards, Bernadette Cornelison, Brian L. Erstad","doi":"10.1016/j.cptl.2025.102394","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>To compare the performance of a generative language model (ChatGPT-3.5) to pharmacy students on therapeutics examinations.</div></div><div><h3>Methods</h3><div>Questions were drawn from two pharmacotherapeutics courses in a 4-year PharmD program. Questions were classified as case based or non-case based and application or recall. Questions were entered into ChatGPT version 3.5 and responses were scored. ChatGPT's score for each exam was calculated by dividing the number of correct responses by the total number of questions. The mean composite score for ChatGPT was calculated by adding individual scores from each exam and dividing by the number of exams. The mean composite score for the students was calculated by dividing the sum of the mean class performance on each exam divided by the number of exams. Chi-square was used to identify factors associated with incorrect responses from ChatGPT.</div></div><div><h3>Results</h3><div>The mean composite score across 6 exams for ChatGPT was 53 (SD = 19.2) compared to 82 (SD = 4) for the pharmacy students (<em>p</em> = 0.0048). ChatGPT answered 51 % of questions correctly. ChatGPT was less likely to answer application-based questions correctly compared to recall-based questions (44 % vs 80 %) and less likely to answer case-based questions correctly compared to non-case-based questions (45 % vs 74 %).</div></div><div><h3>Conclusion</h3><div>ChatGPT scored lower than the average grade for pharmacy students and was less likely to answer application-based and case-based questions correctly. These findings provide valuable insight into how this technology will perform which can help to inform best practices for item development and helps highlight the limitations of this technology.</div></div>","PeriodicalId":47501,"journal":{"name":"Currents in Pharmacy Teaching and Learning","volume":"17 9","pages":"Article 102394"},"PeriodicalIF":1.4000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Currents in Pharmacy Teaching and Learning","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877129725001157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

To compare the performance of a generative language model (ChatGPT-3.5) to pharmacy students on therapeutics examinations.

Methods

Questions were drawn from two pharmacotherapeutics courses in a 4-year PharmD program. Questions were classified as case based or non-case based and application or recall. Questions were entered into ChatGPT version 3.5 and responses were scored. ChatGPT's score for each exam was calculated by dividing the number of correct responses by the total number of questions. The mean composite score for ChatGPT was calculated by adding individual scores from each exam and dividing by the number of exams. The mean composite score for the students was calculated by dividing the sum of the mean class performance on each exam divided by the number of exams. Chi-square was used to identify factors associated with incorrect responses from ChatGPT.

Results

The mean composite score across 6 exams for ChatGPT was 53 (SD = 19.2) compared to 82 (SD = 4) for the pharmacy students (p = 0.0048). ChatGPT answered 51 % of questions correctly. ChatGPT was less likely to answer application-based questions correctly compared to recall-based questions (44 % vs 80 %) and less likely to answer case-based questions correctly compared to non-case-based questions (45 % vs 74 %).

Conclusion

ChatGPT scored lower than the average grade for pharmacy students and was less likely to answer application-based and case-based questions correctly. These findings provide valuable insight into how this technology will perform which can help to inform best practices for item development and helps highlight the limitations of this technology.

查看原文本刊更多论文

生成式大语言模型与药学学生治疗学考试成绩的比较

目的比较生成语言模型（ChatGPT-3.5）在药学专业学生治疗学考试中的表现。方法从一个四年制药学博士项目的两门药物治疗学课程中抽取问题。问题被分类为基于案例或非基于案例，应用或召回。将问题输入ChatGPT 3.5版本，并对回答进行评分。ChatGPT每次考试的分数是通过将正确答案的数量除以问题总数来计算的。ChatGPT的平均综合分数是通过将每次考试的个别分数相加并除以考试次数来计算的。学生的平均综合分数是通过每次考试的平均班级成绩除以考试次数的总和来计算的。卡方用于识别与ChatGPT错误回答相关的因素。结果药学专业学生6门考试平均综合得分为53分（SD = 19.2），药学专业学生平均综合得分为82分（SD = 4）（p = 0.0048）。ChatGPT答对了51%的问题。与基于回忆的问题相比，ChatGPT正确回答基于应用的问题的可能性更低（44%对80%），与非基于案例的问题相比，ChatGPT正确回答基于案例的问题的可能性更低（45%对74%）。结论chatgpt得分低于药学专业学生的平均成绩，且对基于应用和案例的问题的答对率较低。这些发现为这项技术将如何执行提供了有价值的见解，可以帮助告知项目开发的最佳实践，并有助于突出这项技术的局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Currents in Pharmacy Teaching and Learning EDUCATION, SCIENTIFIC DISCIPLINES-

CiteScore

2.10

自引率

16.70%

发文量

192