{"title":"Assessment of the Large Language Models in Creating Dental Board-Style Questions: A Prospective Cross-Sectional Study.","authors":"Nguyen Viet Anh, Nguyen Thi Trang","doi":"10.1111/eje.70015","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Although some studies have investigated the application of large language models (LLMs) in generating dental-related multiple-choice questions (MCQs), they have primarily focused on ChatGPT and Gemini. This study aims to evaluate and compare the performance of five of the LLMs in generating dental board-style questions.</p><p><strong>Materials and methods: </strong>This prospective cross-sectional study evaluated five of the advanced LLMs as of August 2024, including ChatGPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Copilot Pro (Microsoft), Gemini 1.5 Pro (Google) and Mistral Large 2 (Mistral AI). The five most recent clinical guidelines published by The Journal of the American Dental Association were used to generate a total of 350 questions (70 questions per LLM). Each question was independently evaluated by two investigators based on five criteria: clarity, relevance, suitability, distractor and rationale, using a 10-point Likert scale.</p><p><strong>Result: </strong>Inter-rater reliability was substantial (kappa score: 0.7-0.8). Median scores for clarity, relevance and rationale were above 9 across all five LLMs. Suitability and distractor had median scores ranging from 8 to 9. Within each LLM, clarity and relevance scored higher than other criteria (p < 0.05). No significant difference was observed between models regarding clarity, relevance and suitability (p > 0.05). Claude 3.5 Sonnet outperformed other models in providing rationales for answers (p < 0.01).</p><p><strong>Conclusion: </strong>LLMs demonstrate strong capabilities in generating high-quality, clinically relevant dental board-style questions. Among them, Claude 3.5 Sonnet exhibited the best performance in providing rationales for answers.</p>","PeriodicalId":50488,"journal":{"name":"European Journal of Dental Education","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Dental Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/eje.70015","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Although some studies have investigated the application of large language models (LLMs) in generating dental-related multiple-choice questions (MCQs), they have primarily focused on ChatGPT and Gemini. This study aims to evaluate and compare the performance of five of the LLMs in generating dental board-style questions.
Materials and methods: This prospective cross-sectional study evaluated five of the advanced LLMs as of August 2024, including ChatGPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Copilot Pro (Microsoft), Gemini 1.5 Pro (Google) and Mistral Large 2 (Mistral AI). The five most recent clinical guidelines published by The Journal of the American Dental Association were used to generate a total of 350 questions (70 questions per LLM). Each question was independently evaluated by two investigators based on five criteria: clarity, relevance, suitability, distractor and rationale, using a 10-point Likert scale.
Result: Inter-rater reliability was substantial (kappa score: 0.7-0.8). Median scores for clarity, relevance and rationale were above 9 across all five LLMs. Suitability and distractor had median scores ranging from 8 to 9. Within each LLM, clarity and relevance scored higher than other criteria (p < 0.05). No significant difference was observed between models regarding clarity, relevance and suitability (p > 0.05). Claude 3.5 Sonnet outperformed other models in providing rationales for answers (p < 0.01).
Conclusion: LLMs demonstrate strong capabilities in generating high-quality, clinically relevant dental board-style questions. Among them, Claude 3.5 Sonnet exhibited the best performance in providing rationales for answers.
期刊介绍:
The aim of the European Journal of Dental Education is to publish original topical and review articles of the highest quality in the field of Dental Education. The Journal seeks to disseminate widely the latest information on curriculum development teaching methodologies assessment techniques and quality assurance in the fields of dental undergraduate and postgraduate education and dental auxiliary personnel training. The scope includes the dental educational aspects of the basic medical sciences the behavioural sciences the interface with medical education information technology and distance learning and educational audit. Papers embodying the results of high-quality educational research of relevance to dentistry are particularly encouraged as are evidence-based reports of novel and established educational programmes and their outcomes.