Generative AI vs. human expertise: a comparative analysis of case-based rational pharmacotherapy question generation.

IF 2.4 3区 医学 Q3 PHARMACOLOGY & PHARMACY
Muhammed Cihan Güvel, Yavuz Selim Kıyak, Hacer Doğan Varan, Burak Sezenöz, Özlem Coşkun, Canan Uluoğlu
{"title":"Generative AI vs. human expertise: a comparative analysis of case-based rational pharmacotherapy question generation.","authors":"Muhammed Cihan Güvel, Yavuz Selim Kıyak, Hacer Doğan Varan, Burak Sezenöz, Özlem Coşkun, Canan Uluoğlu","doi":"10.1007/s00228-025-03838-2","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study evaluated the performance of three generative AI models-ChatGPT- 4o, Gemini 1.5 Advanced Pro, and Claude 3.5 Sonnet-in producing case-based rational pharmacology questions compared to expert educators.</p><p><strong>Methods: </strong>Using one-shot prompting, 60 questions (20 per model) addressing essential hypertension and type 2 diabetes subjects were generated. A multidisciplinary panel categorized questions by usability (no revisions needed, minor or major revisions required, or unusable). Subsequently, 24 AI-generated and 8 expert-created questions were asked to 103 medical students in a real-world exam setting. Performance metrics, including correct response rate, discrimination index, and identification of nonfunctional distractors, were analyzed.</p><p><strong>Results: </strong>No statistically significant differences were found between AI-generated and expert-created questions, with mean correct response rates surpassing 50% and discrimination indices consistently equal to or above 0.20. Claude produced the highest proportion of error-free items (12/20), whereas ChatGPT exhibited the fewest unusable items (5/20). Expert revisions required approximately one minute per AI-generated question, representing a substantial efficiency gain over manual question preperation. Nonetheless, 19 out of 60 AI-generated questions were deemed unusable, highlighting the necessity of expert oversight.</p><p><strong>Conclusion: </strong>Large language models can profoundly accelerate the development of high-quality assessment questions in medical education. However, expert review remains critical to address lapses in reliability and validity. A hybrid model, integrating AI-driven efficiencies with rigorous expert validation, may offer an optimal approach for enhancing educational outcomes.</p>","PeriodicalId":11857,"journal":{"name":"European Journal of Clinical Pharmacology","volume":"81 6","pages":"875-883"},"PeriodicalIF":2.4000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Clinical Pharmacology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00228-025-03838-2","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/9 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: This study evaluated the performance of three generative AI models-ChatGPT- 4o, Gemini 1.5 Advanced Pro, and Claude 3.5 Sonnet-in producing case-based rational pharmacology questions compared to expert educators.

Methods: Using one-shot prompting, 60 questions (20 per model) addressing essential hypertension and type 2 diabetes subjects were generated. A multidisciplinary panel categorized questions by usability (no revisions needed, minor or major revisions required, or unusable). Subsequently, 24 AI-generated and 8 expert-created questions were asked to 103 medical students in a real-world exam setting. Performance metrics, including correct response rate, discrimination index, and identification of nonfunctional distractors, were analyzed.

Results: No statistically significant differences were found between AI-generated and expert-created questions, with mean correct response rates surpassing 50% and discrimination indices consistently equal to or above 0.20. Claude produced the highest proportion of error-free items (12/20), whereas ChatGPT exhibited the fewest unusable items (5/20). Expert revisions required approximately one minute per AI-generated question, representing a substantial efficiency gain over manual question preperation. Nonetheless, 19 out of 60 AI-generated questions were deemed unusable, highlighting the necessity of expert oversight.

Conclusion: Large language models can profoundly accelerate the development of high-quality assessment questions in medical education. However, expert review remains critical to address lapses in reliability and validity. A hybrid model, integrating AI-driven efficiencies with rigorous expert validation, may offer an optimal approach for enhancing educational outcomes.

生成人工智能与人类专业知识:基于案例的理性药物治疗问题生成的比较分析。
目的:本研究评估了三种生成式人工智能模型(chatgpt - 40、Gemini 1.5 Advanced Pro和Claude 3.5 sonnet)在生成基于案例的理性药理学问题方面的表现,并与专家教育者进行了比较。方法:采用一次性问答的方式,生成60个问题(每个模型20个),分别针对原发性高血压和2型糖尿病受试者。一个多学科小组根据可用性对问题进行分类(不需要修改,需要小的或大的修改,或者不可用)。随后,在现实世界的考试环境中,103名医学生被问及24个人工智能生成的问题和8个专家创建的问题。分析了包括正确反应率、辨别指数和非功能性干扰物识别在内的性能指标。结果:人工智能生成的问题与专家生成的问题之间没有统计学上的差异,平均正确率超过50%,歧视指数始终等于或高于0.20。Claude产生的无差错项目比例最高(12/20),而ChatGPT展示的不可用项目最少(5/20)。专家修改每个人工智能生成的问题大约需要一分钟,这比人工准备问题的效率有了很大的提高。尽管如此,在人工智能生成的60个问题中,有19个被认为是不可用的,这突出了专家监督的必要性。结论:大语言模型可以深刻地促进医学教育中高质量评题题的发展。然而,专家审查仍然是解决可靠性和有效性失误的关键。将人工智能驱动的效率与严格的专家验证相结合的混合模型可能为提高教育成果提供最佳方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.40
自引率
3.40%
发文量
170
审稿时长
3-8 weeks
期刊介绍: The European Journal of Clinical Pharmacology publishes original papers on all aspects of clinical pharmacology and drug therapy in humans. Manuscripts are welcomed on the following topics: therapeutic trials, pharmacokinetics/pharmacodynamics, pharmacogenetics, drug metabolism, adverse drug reactions, drug interactions, all aspects of drug development, development relating to teaching in clinical pharmacology, pharmacoepidemiology, and matters relating to the rational prescribing and safe use of drugs. Methodological contributions relevant to these topics are also welcomed. Data from animal experiments are accepted only in the context of original data in man reported in the same paper. EJCP will only consider manuscripts describing the frequency of allelic variants in different populations if this information is linked to functional data or new interesting variants. Highly relevant differences in frequency with a major impact in drug therapy for the respective population may be submitted as a letter to the editor. Straightforward phase I pharmacokinetic or pharmacodynamic studies as parts of new drug development will only be considered for publication if the paper involves -a compound that is interesting and new in some basic or fundamental way, or -methods that are original in some basic sense, or -a highly unexpected outcome, or -conclusions that are scientifically novel in some basic or fundamental sense.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信