Muhammed Cihan Güvel, Yavuz Selim Kıyak, Hacer Doğan Varan, Burak Sezenöz, Özlem Coşkun, Canan Uluoğlu
{"title":"Generative AI vs. human expertise: a comparative analysis of case-based rational pharmacotherapy question generation.","authors":"Muhammed Cihan Güvel, Yavuz Selim Kıyak, Hacer Doğan Varan, Burak Sezenöz, Özlem Coşkun, Canan Uluoğlu","doi":"10.1007/s00228-025-03838-2","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study evaluated the performance of three generative AI models-ChatGPT- 4o, Gemini 1.5 Advanced Pro, and Claude 3.5 Sonnet-in producing case-based rational pharmacology questions compared to expert educators.</p><p><strong>Methods: </strong>Using one-shot prompting, 60 questions (20 per model) addressing essential hypertension and type 2 diabetes subjects were generated. A multidisciplinary panel categorized questions by usability (no revisions needed, minor or major revisions required, or unusable). Subsequently, 24 AI-generated and 8 expert-created questions were asked to 103 medical students in a real-world exam setting. Performance metrics, including correct response rate, discrimination index, and identification of nonfunctional distractors, were analyzed.</p><p><strong>Results: </strong>No statistically significant differences were found between AI-generated and expert-created questions, with mean correct response rates surpassing 50% and discrimination indices consistently equal to or above 0.20. Claude produced the highest proportion of error-free items (12/20), whereas ChatGPT exhibited the fewest unusable items (5/20). Expert revisions required approximately one minute per AI-generated question, representing a substantial efficiency gain over manual question preperation. Nonetheless, 19 out of 60 AI-generated questions were deemed unusable, highlighting the necessity of expert oversight.</p><p><strong>Conclusion: </strong>Large language models can profoundly accelerate the development of high-quality assessment questions in medical education. However, expert review remains critical to address lapses in reliability and validity. A hybrid model, integrating AI-driven efficiencies with rigorous expert validation, may offer an optimal approach for enhancing educational outcomes.</p>","PeriodicalId":11857,"journal":{"name":"European Journal of Clinical Pharmacology","volume":"81 6","pages":"875-883"},"PeriodicalIF":2.4000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Clinical Pharmacology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00228-025-03838-2","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/9 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: This study evaluated the performance of three generative AI models-ChatGPT- 4o, Gemini 1.5 Advanced Pro, and Claude 3.5 Sonnet-in producing case-based rational pharmacology questions compared to expert educators.
Methods: Using one-shot prompting, 60 questions (20 per model) addressing essential hypertension and type 2 diabetes subjects were generated. A multidisciplinary panel categorized questions by usability (no revisions needed, minor or major revisions required, or unusable). Subsequently, 24 AI-generated and 8 expert-created questions were asked to 103 medical students in a real-world exam setting. Performance metrics, including correct response rate, discrimination index, and identification of nonfunctional distractors, were analyzed.
Results: No statistically significant differences were found between AI-generated and expert-created questions, with mean correct response rates surpassing 50% and discrimination indices consistently equal to or above 0.20. Claude produced the highest proportion of error-free items (12/20), whereas ChatGPT exhibited the fewest unusable items (5/20). Expert revisions required approximately one minute per AI-generated question, representing a substantial efficiency gain over manual question preperation. Nonetheless, 19 out of 60 AI-generated questions were deemed unusable, highlighting the necessity of expert oversight.
Conclusion: Large language models can profoundly accelerate the development of high-quality assessment questions in medical education. However, expert review remains critical to address lapses in reliability and validity. A hybrid model, integrating AI-driven efficiencies with rigorous expert validation, may offer an optimal approach for enhancing educational outcomes.
期刊介绍:
The European Journal of Clinical Pharmacology publishes original papers on all aspects of clinical pharmacology and drug therapy in humans. Manuscripts are welcomed on the following topics: therapeutic trials, pharmacokinetics/pharmacodynamics, pharmacogenetics, drug metabolism, adverse drug reactions, drug interactions, all aspects of drug development, development relating to teaching in clinical pharmacology, pharmacoepidemiology, and matters relating to the rational prescribing and safe use of drugs. Methodological contributions relevant to these topics are also welcomed.
Data from animal experiments are accepted only in the context of original data in man reported in the same paper. EJCP will only consider manuscripts describing the frequency of allelic variants in different populations if this information is linked to functional data or new interesting variants. Highly relevant differences in frequency with a major impact in drug therapy for the respective population may be submitted as a letter to the editor.
Straightforward phase I pharmacokinetic or pharmacodynamic studies as parts of new drug development will only be considered for publication if the paper involves
-a compound that is interesting and new in some basic or fundamental way, or
-methods that are original in some basic sense, or
-a highly unexpected outcome, or
-conclusions that are scientifically novel in some basic or fundamental sense.