Benchmarking ChatGPT-3.5 and OpenAI o3 Against Clinical Pharmacists: Preliminary Insights into Clinical Accuracy, Sensitivity, and Specificity in Pharmacy MCQs.

IF 2.7 4区 医学 Q2 HEALTH CARE SCIENCES & SERVICES
Esraa M Alsaudi, Sireen A Shilbayeh, Rana K Abu-Farha
{"title":"Benchmarking ChatGPT-3.5 and OpenAI o3 Against Clinical Pharmacists: Preliminary Insights into Clinical Accuracy, Sensitivity, and Specificity in Pharmacy MCQs.","authors":"Esraa M Alsaudi, Sireen A Shilbayeh, Rana K Abu-Farha","doi":"10.3390/healthcare13141751","DOIUrl":null,"url":null,"abstract":"<p><p><b>Objective:</b> This proof-of-concept study aimed to evaluate and compare the clinical performance of two AI language models (ChatGPT-3.5 and OpenAI o3) in answering clinical pharmacy multiple-choice questions (MCQs), benchmarked against responses from specialist clinical pharmacists in Jordan, including academic preceptors and hospital-based clinicians. <b>Methods:</b> A total of 60 clinical pharmacy MCQs were developed based on current guidelines across four therapeutic areas: cardiovascular, endocrine, infectious, and respiratory diseases. Each item was reviewed by academic and clinical experts and then pilot-tested with five pharmacists to determine clarity and difficulty. Two ChatGPT models-GPT-3.5 and OpenAI o3-were tested using a standardized prompt for each MCQ, entered in separate sessions to avoid memory retention. Their answers were classified as true/false positives or negatives and retested after two weeks to assess reproducibility. Simultaneously, 25 licensed pharmacists (primarily from one academic institution and several hospitals in Amman) completed the same MCQs using validated references (excluding AI tools). Accuracy, sensitivity, specificity, and Cohen's Kappa were used to compare AI and human performance, with statistical analysis conducted using appropriate tests at a significance level of <i>p</i> ≤ 0.05. <b>Results:</b> OpenAI o3 achieved the highest accuracy (83.3%), sensitivity (90.0%), and specificity (70.0%), outperforming GPT-3.5 (70.0%, 77.5%, 55.0%) and pharmacists (69.7%, 77.0%, 55.0%). AI performance declined significantly with increasing question difficulty. OpenAI o3 showed the highest accuracy in the cardiovascular domain (93.3%), while GPT-3.5 performed best in infectious diseases (80.0%). Reproducibility was higher for GPT-3.5 (81.6%, κ = 0.556) than OpenAI o3 (76.7%, κ = 0.364). Over two test rounds, GPT-3.5's accuracy remained stable, whereas OpenAI o3's accuracy decreased from 83.3% to 70.0%, indicating some variability. <b>Conclusions:</b> OpenAI o3 shows strong promise as a clinical decision-support tool in pharmacy, especially for low- to moderate-difficulty questions. However, inconsistencies in reproducibility and limitations in complex cases highlight the importance of cautious, supervised integration alongside human expertise.</p>","PeriodicalId":12977,"journal":{"name":"Healthcare","volume":"13 14","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12294861/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/healthcare13141751","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: This proof-of-concept study aimed to evaluate and compare the clinical performance of two AI language models (ChatGPT-3.5 and OpenAI o3) in answering clinical pharmacy multiple-choice questions (MCQs), benchmarked against responses from specialist clinical pharmacists in Jordan, including academic preceptors and hospital-based clinicians. Methods: A total of 60 clinical pharmacy MCQs were developed based on current guidelines across four therapeutic areas: cardiovascular, endocrine, infectious, and respiratory diseases. Each item was reviewed by academic and clinical experts and then pilot-tested with five pharmacists to determine clarity and difficulty. Two ChatGPT models-GPT-3.5 and OpenAI o3-were tested using a standardized prompt for each MCQ, entered in separate sessions to avoid memory retention. Their answers were classified as true/false positives or negatives and retested after two weeks to assess reproducibility. Simultaneously, 25 licensed pharmacists (primarily from one academic institution and several hospitals in Amman) completed the same MCQs using validated references (excluding AI tools). Accuracy, sensitivity, specificity, and Cohen's Kappa were used to compare AI and human performance, with statistical analysis conducted using appropriate tests at a significance level of p ≤ 0.05. Results: OpenAI o3 achieved the highest accuracy (83.3%), sensitivity (90.0%), and specificity (70.0%), outperforming GPT-3.5 (70.0%, 77.5%, 55.0%) and pharmacists (69.7%, 77.0%, 55.0%). AI performance declined significantly with increasing question difficulty. OpenAI o3 showed the highest accuracy in the cardiovascular domain (93.3%), while GPT-3.5 performed best in infectious diseases (80.0%). Reproducibility was higher for GPT-3.5 (81.6%, κ = 0.556) than OpenAI o3 (76.7%, κ = 0.364). Over two test rounds, GPT-3.5's accuracy remained stable, whereas OpenAI o3's accuracy decreased from 83.3% to 70.0%, indicating some variability. Conclusions: OpenAI o3 shows strong promise as a clinical decision-support tool in pharmacy, especially for low- to moderate-difficulty questions. However, inconsistencies in reproducibility and limitations in complex cases highlight the importance of cautious, supervised integration alongside human expertise.

ChatGPT-3.5和OpenAI o3对临床药师的基准测试:对药学mcq临床准确性、敏感性和特异性的初步认识
目的:这项概念验证研究旨在评估和比较两种人工智能语言模型(ChatGPT-3.5和OpenAI o3)在回答临床药学多项选择题(mcq)方面的临床表现,并以约旦专业临床药剂师(包括学术导师和医院临床医生)的回答为基准。方法:根据心血管、内分泌、感染性和呼吸系统疾病4个治疗领域的现行指南,共编制60个临床药学mcq。每个项目都由学术和临床专家审查,然后由五名药剂师进行试点测试,以确定清晰度和难度。两个ChatGPT模型——gpt -3.5和OpenAI o3——使用每个MCQ的标准化提示进行测试,在单独的会话中输入以避免记忆保留。他们的回答被分类为真/假阳性或阴性,并在两周后重新测试以评估再现性。同时,25名持牌药剂师(主要来自安曼的一家学术机构和几家医院)使用经过验证的参考资料(不包括人工智能工具)完成了相同的mcq。采用准确性、敏感性、特异性和Cohen’s Kappa比较人工智能与人类的表现,采用适当的检验进行统计学分析,p≤0.05。结果:OpenAI o3的准确率(83.3%)、灵敏度(90.0%)和特异性(70.0%)均高于GPT-3.5(70.0%、77.5%、55.0%)和药师(69.7%、77.0%、55.0%)。随着问题难度的增加,人工智能的表现明显下降。OpenAI o3在心血管领域的准确率最高(93.3%),而GPT-3.5在传染病领域的准确率最高(80.0%)。GPT-3.5的重现性(81.6%,κ = 0.556)高于OpenAI o3 (76.7%, κ = 0.364)。经过两轮测试,GPT-3.5的准确率保持稳定,而OpenAI o3的准确率从83.3%下降到70.0%,表明存在一定的可变性。结论:OpenAI o3在药剂学临床决策支持工具方面表现出强大的前景,特别是对于低到中等难度的问题。然而,在可重复性方面的不一致和复杂病例的局限性突出了谨慎、有监督的整合与人类专业知识的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Healthcare
Healthcare Medicine-Health Policy
CiteScore
3.50
自引率
7.10%
发文量
0
审稿时长
47 days
期刊介绍: Healthcare (ISSN 2227-9032) is an international, peer-reviewed, open access journal (free for readers), which publishes original theoretical and empirical work in the interdisciplinary area of all aspects of medicine and health care research. Healthcare publishes Original Research Articles, Reviews, Case Reports, Research Notes and Short Communications. We encourage researchers to publish their experimental and theoretical results in as much detail as possible. For theoretical papers, full details of proofs must be provided so that the results can be checked; for experimental papers, full experimental details must be provided so that the results can be reproduced. Additionally, electronic files or software regarding the full details of the calculations, experimental procedure, etc., can be deposited along with the publication as “Supplementary Material”.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信