Christopher J. Edwards, Bernadette Cornelison, Brian L. Erstad
{"title":"Comparison of a generative large language model to pharmacy student performance on therapeutics examinations","authors":"Christopher J. Edwards, Bernadette Cornelison, Brian L. Erstad","doi":"10.1016/j.cptl.2025.102394","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>To compare the performance of a generative language model (ChatGPT-3.5) to pharmacy students on therapeutics examinations.</div></div><div><h3>Methods</h3><div>Questions were drawn from two pharmacotherapeutics courses in a 4-year PharmD program. Questions were classified as case based or non-case based and application or recall. Questions were entered into ChatGPT version 3.5 and responses were scored. ChatGPT's score for each exam was calculated by dividing the number of correct responses by the total number of questions. The mean composite score for ChatGPT was calculated by adding individual scores from each exam and dividing by the number of exams. The mean composite score for the students was calculated by dividing the sum of the mean class performance on each exam divided by the number of exams. Chi-square was used to identify factors associated with incorrect responses from ChatGPT.</div></div><div><h3>Results</h3><div>The mean composite score across 6 exams for ChatGPT was 53 (SD = 19.2) compared to 82 (SD = 4) for the pharmacy students (<em>p</em> = 0.0048). ChatGPT answered 51 % of questions correctly. ChatGPT was less likely to answer application-based questions correctly compared to recall-based questions (44 % vs 80 %) and less likely to answer case-based questions correctly compared to non-case-based questions (45 % vs 74 %).</div></div><div><h3>Conclusion</h3><div>ChatGPT scored lower than the average grade for pharmacy students and was less likely to answer application-based and case-based questions correctly. These findings provide valuable insight into how this technology will perform which can help to inform best practices for item development and helps highlight the limitations of this technology.</div></div>","PeriodicalId":47501,"journal":{"name":"Currents in Pharmacy Teaching and Learning","volume":"17 9","pages":"Article 102394"},"PeriodicalIF":1.3000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Currents in Pharmacy Teaching and Learning","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877129725001157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
To compare the performance of a generative language model (ChatGPT-3.5) to pharmacy students on therapeutics examinations.
Methods
Questions were drawn from two pharmacotherapeutics courses in a 4-year PharmD program. Questions were classified as case based or non-case based and application or recall. Questions were entered into ChatGPT version 3.5 and responses were scored. ChatGPT's score for each exam was calculated by dividing the number of correct responses by the total number of questions. The mean composite score for ChatGPT was calculated by adding individual scores from each exam and dividing by the number of exams. The mean composite score for the students was calculated by dividing the sum of the mean class performance on each exam divided by the number of exams. Chi-square was used to identify factors associated with incorrect responses from ChatGPT.
Results
The mean composite score across 6 exams for ChatGPT was 53 (SD = 19.2) compared to 82 (SD = 4) for the pharmacy students (p = 0.0048). ChatGPT answered 51 % of questions correctly. ChatGPT was less likely to answer application-based questions correctly compared to recall-based questions (44 % vs 80 %) and less likely to answer case-based questions correctly compared to non-case-based questions (45 % vs 74 %).
Conclusion
ChatGPT scored lower than the average grade for pharmacy students and was less likely to answer application-based and case-based questions correctly. These findings provide valuable insight into how this technology will perform which can help to inform best practices for item development and helps highlight the limitations of this technology.