在临床药理学和治疗学评估中评估和利用大型语言模型:从考生到考试制定者。

IF 3.1 3区 医学 Q2 PHARMACOLOGY & PHARMACY
Alexandre O Gérard, Diane Merino, Marc Labriffe, Fanny Rocher, Delphine Viard, Laurence Zemori, Thibaud Lavrut, Erik M Donker, Joost D Piët, Jean-Paul Fournier, Milou-Daniel Drici, Alexandre Destere
{"title":"在临床药理学和治疗学评估中评估和利用大型语言模型:从考生到考试制定者。","authors":"Alexandre O Gérard, Diane Merino, Marc Labriffe, Fanny Rocher, Delphine Viard, Laurence Zemori, Thibaud Lavrut, Erik M Donker, Joost D Piët, Jean-Paul Fournier, Milou-Daniel Drici, Alexandre Destere","doi":"10.1002/bcp.70137","DOIUrl":null,"url":null,"abstract":"<p><strong>Aims: </strong>In medical education, the ability of large language models (LLMs) to match human performance raises questions about their potential as educational tools. This study evaluates LLMs' performance on Clinical Pharmacology and Therapeutics (CPT) exams, comparing their results to medical students and exploring their ability to identify poorly formulated multiple-choice questions (MCQs).</p><p><strong>Methods: </strong>ChatGPT-4 Omni, Gemini Advanced, Le Chat and DeepSeek R1 were tested on local CPT exams (third year of bachelor's degree, first/second year of master's degree) and the European Prescribing Exam (EuroPE<sup>+</sup>). The exams included MCQs and open-ended questions assessing knowledge and prescribing skills. LLM results were analysed using the same scoring system as students. A confusion matrix was used to evaluate the ability of ChatGPT and Gemini to identify ambiguous/erroneous MCQs.</p><p><strong>Results: </strong>LLMs achieved comparable or superior results to medical students across all levels. For local exams, LLMs outperformed M1 students and matched L3 and M2 students. In EuroPE<sup>+</sup>, LLMs significantly outperformed students in both the knowledge and prescribing skills sections. All LLM errors in EuroPE<sup>+</sup> were genuine (100%), whereas local exam errors were frequently due to ambiguities or correction flaws (24.3%). When both ChatGPT and Gemini provided the same incorrect answer to an MCQ, the specificity for detecting ambiguous questions was 92.9%, with a negative predictive value of 85.5%.</p><p><strong>Conclusion: </strong>LLMs demonstrate capabilities comparable to or exceeding medical students in CPT exams. Their ability to flag potentially flawed MCQs highlights their value not only as educational tools but also as quality control instruments in exam preparation.</p>","PeriodicalId":9251,"journal":{"name":"British journal of clinical pharmacology","volume":" ","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating and leveraging large language models in clinical pharmacology and therapeutics assessment: From exam takers to exam shapers.\",\"authors\":\"Alexandre O Gérard, Diane Merino, Marc Labriffe, Fanny Rocher, Delphine Viard, Laurence Zemori, Thibaud Lavrut, Erik M Donker, Joost D Piët, Jean-Paul Fournier, Milou-Daniel Drici, Alexandre Destere\",\"doi\":\"10.1002/bcp.70137\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Aims: </strong>In medical education, the ability of large language models (LLMs) to match human performance raises questions about their potential as educational tools. This study evaluates LLMs' performance on Clinical Pharmacology and Therapeutics (CPT) exams, comparing their results to medical students and exploring their ability to identify poorly formulated multiple-choice questions (MCQs).</p><p><strong>Methods: </strong>ChatGPT-4 Omni, Gemini Advanced, Le Chat and DeepSeek R1 were tested on local CPT exams (third year of bachelor's degree, first/second year of master's degree) and the European Prescribing Exam (EuroPE<sup>+</sup>). The exams included MCQs and open-ended questions assessing knowledge and prescribing skills. LLM results were analysed using the same scoring system as students. A confusion matrix was used to evaluate the ability of ChatGPT and Gemini to identify ambiguous/erroneous MCQs.</p><p><strong>Results: </strong>LLMs achieved comparable or superior results to medical students across all levels. For local exams, LLMs outperformed M1 students and matched L3 and M2 students. In EuroPE<sup>+</sup>, LLMs significantly outperformed students in both the knowledge and prescribing skills sections. All LLM errors in EuroPE<sup>+</sup> were genuine (100%), whereas local exam errors were frequently due to ambiguities or correction flaws (24.3%). When both ChatGPT and Gemini provided the same incorrect answer to an MCQ, the specificity for detecting ambiguous questions was 92.9%, with a negative predictive value of 85.5%.</p><p><strong>Conclusion: </strong>LLMs demonstrate capabilities comparable to or exceeding medical students in CPT exams. Their ability to flag potentially flawed MCQs highlights their value not only as educational tools but also as quality control instruments in exam preparation.</p>\",\"PeriodicalId\":9251,\"journal\":{\"name\":\"British journal of clinical pharmacology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"British journal of clinical pharmacology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/bcp.70137\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PHARMACOLOGY & PHARMACY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"British journal of clinical pharmacology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/bcp.70137","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}
引用次数: 0

摘要

目的:在医学教育中,大型语言模型(llm)匹配人类表现的能力引发了关于它们作为教育工具的潜力的问题。本研究评估了法学硕士在临床药理学和治疗学(CPT)考试中的表现,将他们的结果与医科学生进行比较,并探索他们识别不良表述的多项选择题(mcq)的能力。方法:对ChatGPT-4 Omni、Gemini Advanced、Le Chat和DeepSeek R1进行本地CPT考试(学士学位三年级、硕士学位一/二年级)和欧洲处方考试(EuroPE+)的测试。考试包括mcq和评估知识和处方技能的开放式问题。法学硕士的成绩分析使用与学生相同的评分系统。混淆矩阵用于评估ChatGPT和Gemini识别模糊/错误mcq的能力。结果:法学硕士在所有层次上取得了与医学生相当或更好的结果。在本地考试中,llm的表现优于M1学生,与L3和M2学生不相上下。在欧洲+,法学硕士在知识和处方技能方面的表现明显优于学生。欧洲+的所有法学硕士错误都是真实的(100%),而当地的考试错误通常是由于歧义或纠正错误(24.3%)。当ChatGPT和Gemini对MCQ提供相同的错误答案时,检测歧义问题的特异性为92.9%,负预测值为85.5%。结论:法学硕士在CPT考试中表现出与医学生相当或超过医学生的能力。他们标记潜在缺陷mcq的能力突出了它们的价值,不仅是作为教育工具,而且是考试准备中的质量控制工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluating and leveraging large language models in clinical pharmacology and therapeutics assessment: From exam takers to exam shapers.

Aims: In medical education, the ability of large language models (LLMs) to match human performance raises questions about their potential as educational tools. This study evaluates LLMs' performance on Clinical Pharmacology and Therapeutics (CPT) exams, comparing their results to medical students and exploring their ability to identify poorly formulated multiple-choice questions (MCQs).

Methods: ChatGPT-4 Omni, Gemini Advanced, Le Chat and DeepSeek R1 were tested on local CPT exams (third year of bachelor's degree, first/second year of master's degree) and the European Prescribing Exam (EuroPE+). The exams included MCQs and open-ended questions assessing knowledge and prescribing skills. LLM results were analysed using the same scoring system as students. A confusion matrix was used to evaluate the ability of ChatGPT and Gemini to identify ambiguous/erroneous MCQs.

Results: LLMs achieved comparable or superior results to medical students across all levels. For local exams, LLMs outperformed M1 students and matched L3 and M2 students. In EuroPE+, LLMs significantly outperformed students in both the knowledge and prescribing skills sections. All LLM errors in EuroPE+ were genuine (100%), whereas local exam errors were frequently due to ambiguities or correction flaws (24.3%). When both ChatGPT and Gemini provided the same incorrect answer to an MCQ, the specificity for detecting ambiguous questions was 92.9%, with a negative predictive value of 85.5%.

Conclusion: LLMs demonstrate capabilities comparable to or exceeding medical students in CPT exams. Their ability to flag potentially flawed MCQs highlights their value not only as educational tools but also as quality control instruments in exam preparation.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
6.30
自引率
8.80%
发文量
419
审稿时长
1 months
期刊介绍: Published on behalf of the British Pharmacological Society, the British Journal of Clinical Pharmacology features papers and reports on all aspects of drug action in humans: review articles, mini review articles, original papers, commentaries, editorials and letters. The Journal enjoys a wide readership, bridging the gap between the medical profession, clinical research and the pharmaceutical industry. It also publishes research on new methods, new drugs and new approaches to treatment. The Journal is recognised as one of the leading publications in its field. It is online only, publishes open access research through its OnlineOpen programme and is published monthly.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信