评估大型语言模型在创建牙科板式问题:一项前瞻性横断面研究。

IF 1.9 4区 教育学 Q3 DENTISTRY, ORAL SURGERY & MEDICINE
Nguyen Viet Anh, Nguyen Thi Trang
{"title":"评估大型语言模型在创建牙科板式问题:一项前瞻性横断面研究。","authors":"Nguyen Viet Anh, Nguyen Thi Trang","doi":"10.1111/eje.70015","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Although some studies have investigated the application of large language models (LLMs) in generating dental-related multiple-choice questions (MCQs), they have primarily focused on ChatGPT and Gemini. This study aims to evaluate and compare the performance of five of the LLMs in generating dental board-style questions.</p><p><strong>Materials and methods: </strong>This prospective cross-sectional study evaluated five of the advanced LLMs as of August 2024, including ChatGPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Copilot Pro (Microsoft), Gemini 1.5 Pro (Google) and Mistral Large 2 (Mistral AI). The five most recent clinical guidelines published by The Journal of the American Dental Association were used to generate a total of 350 questions (70 questions per LLM). Each question was independently evaluated by two investigators based on five criteria: clarity, relevance, suitability, distractor and rationale, using a 10-point Likert scale.</p><p><strong>Result: </strong>Inter-rater reliability was substantial (kappa score: 0.7-0.8). Median scores for clarity, relevance and rationale were above 9 across all five LLMs. Suitability and distractor had median scores ranging from 8 to 9. Within each LLM, clarity and relevance scored higher than other criteria (p < 0.05). No significant difference was observed between models regarding clarity, relevance and suitability (p > 0.05). Claude 3.5 Sonnet outperformed other models in providing rationales for answers (p < 0.01).</p><p><strong>Conclusion: </strong>LLMs demonstrate strong capabilities in generating high-quality, clinically relevant dental board-style questions. Among them, Claude 3.5 Sonnet exhibited the best performance in providing rationales for answers.</p>","PeriodicalId":50488,"journal":{"name":"European Journal of Dental Education","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessment of the Large Language Models in Creating Dental Board-Style Questions: A Prospective Cross-Sectional Study.\",\"authors\":\"Nguyen Viet Anh, Nguyen Thi Trang\",\"doi\":\"10.1111/eje.70015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>Although some studies have investigated the application of large language models (LLMs) in generating dental-related multiple-choice questions (MCQs), they have primarily focused on ChatGPT and Gemini. This study aims to evaluate and compare the performance of five of the LLMs in generating dental board-style questions.</p><p><strong>Materials and methods: </strong>This prospective cross-sectional study evaluated five of the advanced LLMs as of August 2024, including ChatGPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Copilot Pro (Microsoft), Gemini 1.5 Pro (Google) and Mistral Large 2 (Mistral AI). The five most recent clinical guidelines published by The Journal of the American Dental Association were used to generate a total of 350 questions (70 questions per LLM). Each question was independently evaluated by two investigators based on five criteria: clarity, relevance, suitability, distractor and rationale, using a 10-point Likert scale.</p><p><strong>Result: </strong>Inter-rater reliability was substantial (kappa score: 0.7-0.8). Median scores for clarity, relevance and rationale were above 9 across all five LLMs. Suitability and distractor had median scores ranging from 8 to 9. Within each LLM, clarity and relevance scored higher than other criteria (p < 0.05). No significant difference was observed between models regarding clarity, relevance and suitability (p > 0.05). Claude 3.5 Sonnet outperformed other models in providing rationales for answers (p < 0.01).</p><p><strong>Conclusion: </strong>LLMs demonstrate strong capabilities in generating high-quality, clinically relevant dental board-style questions. Among them, Claude 3.5 Sonnet exhibited the best performance in providing rationales for answers.</p>\",\"PeriodicalId\":50488,\"journal\":{\"name\":\"European Journal of Dental Education\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-07-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Journal of Dental Education\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1111/eje.70015\",\"RegionNum\":4,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Dental Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1111/eje.70015","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

摘要

虽然一些研究已经研究了大型语言模型(llm)在生成牙科相关选择题(mcq)中的应用,但他们主要集中在ChatGPT和Gemini上。本研究旨在评估和比较五位法学硕士在生成牙科委员会式问题方面的表现。材料和方法:这项前瞻性横断面研究评估了截至2024年8月的五种高级法学硕士,包括chatgpt - 40 (OpenAI)、Claude 3.5 Sonnet (Anthropic)、Copilot Pro (Microsoft)、Gemini 1.5 Pro(谷歌)和Mistral Large 2 (Mistral AI)。《美国牙科协会杂志》(The Journal of American Dental Association)最新发布的五个临床指南被用来生成总共350个问题(每个法学硕士70个问题)。每个问题由两名调查人员根据五个标准独立评估:清晰度,相关性,适用性,分心和理由,使用10分李克特量表。结果:量表间信度显著(kappa评分:0.7 ~ 0.8)。在所有5位法学硕士中,清晰度、相关性和合理性的中位数得分都在9分以上。适宜性和分心性的中位得分为8 - 9分。在每个LLM中,清晰度和相关性得分高于其他标准(p 0.05)。Claude 3.5 Sonnet在提供答案的基本原理方面优于其他模型(p结论:llm在生成高质量,临床相关的牙科委员会式问题方面表现出强大的能力。其中,克劳德·十四行诗在提供答案的基本原理方面表现最好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Assessment of the Large Language Models in Creating Dental Board-Style Questions: A Prospective Cross-Sectional Study.

Introduction: Although some studies have investigated the application of large language models (LLMs) in generating dental-related multiple-choice questions (MCQs), they have primarily focused on ChatGPT and Gemini. This study aims to evaluate and compare the performance of five of the LLMs in generating dental board-style questions.

Materials and methods: This prospective cross-sectional study evaluated five of the advanced LLMs as of August 2024, including ChatGPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Copilot Pro (Microsoft), Gemini 1.5 Pro (Google) and Mistral Large 2 (Mistral AI). The five most recent clinical guidelines published by The Journal of the American Dental Association were used to generate a total of 350 questions (70 questions per LLM). Each question was independently evaluated by two investigators based on five criteria: clarity, relevance, suitability, distractor and rationale, using a 10-point Likert scale.

Result: Inter-rater reliability was substantial (kappa score: 0.7-0.8). Median scores for clarity, relevance and rationale were above 9 across all five LLMs. Suitability and distractor had median scores ranging from 8 to 9. Within each LLM, clarity and relevance scored higher than other criteria (p < 0.05). No significant difference was observed between models regarding clarity, relevance and suitability (p > 0.05). Claude 3.5 Sonnet outperformed other models in providing rationales for answers (p < 0.01).

Conclusion: LLMs demonstrate strong capabilities in generating high-quality, clinically relevant dental board-style questions. Among them, Claude 3.5 Sonnet exhibited the best performance in providing rationales for answers.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.10
自引率
16.70%
发文量
127
审稿时长
6-12 weeks
期刊介绍: The aim of the European Journal of Dental Education is to publish original topical and review articles of the highest quality in the field of Dental Education. The Journal seeks to disseminate widely the latest information on curriculum development teaching methodologies assessment techniques and quality assurance in the fields of dental undergraduate and postgraduate education and dental auxiliary personnel training. The scope includes the dental educational aspects of the basic medical sciences the behavioural sciences the interface with medical education information technology and distance learning and educational audit. Papers embodying the results of high-quality educational research of relevance to dentistry are particularly encouraged as are evidence-based reports of novel and established educational programmes and their outcomes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信