Assessment of the Large Language Models in Creating Dental Board-Style Questions: A Prospective Cross-Sectional Study.

IF 1.9 4区教育学 Q3 DENTISTRY, ORAL SURGERY & MEDICINE

European Journal of Dental Education Pub Date : 2025-07-16 DOI:10.1111/eje.70015

Nguyen Viet Anh, Nguyen Thi Trang

{"title":"Assessment of the Large Language Models in Creating Dental Board-Style Questions: A Prospective Cross-Sectional Study.","authors":"Nguyen Viet Anh, Nguyen Thi Trang","doi":"10.1111/eje.70015","DOIUrl":null,"url":null,"abstract":"Introduction: Although some studies have investigated the application of large language models (LLMs) in generating dental-related multiple-choice questions (MCQs), they have primarily focused on ChatGPT and Gemini. This study aims to evaluate and compare the performance of five of the LLMs in generating dental board-style questions.Materials and methods: This prospective cross-sectional study evaluated five of the advanced LLMs as of August 2024, including ChatGPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Copilot Pro (Microsoft), Gemini 1.5 Pro (Google) and Mistral Large 2 (Mistral AI). The five most recent clinical guidelines published by The Journal of the American Dental Association were used to generate a total of 350 questions (70 questions per LLM). Each question was independently evaluated by two investigators based on five criteria: clarity, relevance, suitability, distractor and rationale, using a 10-point Likert scale.Result: Inter-rater reliability was substantial (kappa score: 0.7-0.8). Median scores for clarity, relevance and rationale were above 9 across all five LLMs. Suitability and distractor had median scores ranging from 8 to 9. Within each LLM, clarity and relevance scored higher than other criteria (p < 0.05). No significant difference was observed between models regarding clarity, relevance and suitability (p > 0.05). Claude 3.5 Sonnet outperformed other models in providing rationales for answers (p < 0.01).Conclusion: LLMs demonstrate strong capabilities in generating high-quality, clinically relevant dental board-style questions. Among them, Claude 3.5 Sonnet exhibited the best performance in providing rationales for answers.","PeriodicalId":50488,"journal":{"name":"European Journal of Dental Education","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Dental Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/eje.70015","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Although some studies have investigated the application of large language models (LLMs) in generating dental-related multiple-choice questions (MCQs), they have primarily focused on ChatGPT and Gemini. This study aims to evaluate and compare the performance of five of the LLMs in generating dental board-style questions.

Materials and methods: This prospective cross-sectional study evaluated five of the advanced LLMs as of August 2024, including ChatGPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Copilot Pro (Microsoft), Gemini 1.5 Pro (Google) and Mistral Large 2 (Mistral AI). The five most recent clinical guidelines published by The Journal of the American Dental Association were used to generate a total of 350 questions (70 questions per LLM). Each question was independently evaluated by two investigators based on five criteria: clarity, relevance, suitability, distractor and rationale, using a 10-point Likert scale.

Result: Inter-rater reliability was substantial (kappa score: 0.7-0.8). Median scores for clarity, relevance and rationale were above 9 across all five LLMs. Suitability and distractor had median scores ranging from 8 to 9. Within each LLM, clarity and relevance scored higher than other criteria (p < 0.05). No significant difference was observed between models regarding clarity, relevance and suitability (p > 0.05). Claude 3.5 Sonnet outperformed other models in providing rationales for answers (p < 0.01).

Conclusion: LLMs demonstrate strong capabilities in generating high-quality, clinically relevant dental board-style questions. Among them, Claude 3.5 Sonnet exhibited the best performance in providing rationales for answers.

查看原文本刊更多论文

评估大型语言模型在创建牙科板式问题：一项前瞻性横断面研究。

虽然一些研究已经研究了大型语言模型（llm）在生成牙科相关选择题（mcq）中的应用，但他们主要集中在ChatGPT和Gemini上。本研究旨在评估和比较五位法学硕士在生成牙科委员会式问题方面的表现。材料和方法：这项前瞻性横断面研究评估了截至2024年8月的五种高级法学硕士，包括chatgpt - 40 （OpenAI）、Claude 3.5 Sonnet （Anthropic）、Copilot Pro （Microsoft）、Gemini 1.5 Pro（谷歌）和Mistral Large 2 （Mistral AI）。《美国牙科协会杂志》（The Journal of American Dental Association）最新发布的五个临床指南被用来生成总共350个问题（每个法学硕士70个问题）。每个问题由两名调查人员根据五个标准独立评估：清晰度，相关性，适用性，分心和理由，使用10分李克特量表。结果：量表间信度显著（kappa评分：0.7 ~ 0.8）。在所有5位法学硕士中，清晰度、相关性和合理性的中位数得分都在9分以上。适宜性和分心性的中位得分为8 - 9分。在每个LLM中，清晰度和相关性得分高于其他标准（p 0.05）。Claude 3.5 Sonnet在提供答案的基本原理方面优于其他模型(p结论：llm在生成高质量，临床相关的牙科委员会式问题方面表现出强大的能力。其中，克劳德·十四行诗在提供答案的基本原理方面表现最好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Journal of Dental Education 医学-学科教育

CiteScore

4.10

自引率

16.70%

发文量

127

审稿时长

6-12 weeks

期刊介绍： The aim of the European Journal of Dental Education is to publish original topical and review articles of the highest quality in the field of Dental Education. The Journal seeks to disseminate widely the latest information on curriculum development teaching methodologies assessment techniques and quality assurance in the fields of dental undergraduate and postgraduate education and dental auxiliary personnel training. The scope includes the dental educational aspects of the basic medical sciences the behavioural sciences the interface with medical education information technology and distance learning and educational audit. Papers embodying the results of high-quality educational research of relevance to dentistry are particularly encouraged as are evidence-based reports of novel and established educational programmes and their outcomes.