评估六种大型语言模型在儿科牙科中的循证潜力：一项关于生成人工智能的比较研究。

IF 2.3 Q2 DENTISTRY, ORAL SURGERY & MEDICINE

European Archives of Paediatric Dentistry Pub Date : 2025-06-01 Epub Date: 2025-02-22 DOI:10.1007/s40368-025-01012-x

Anastasia Dermata, Aristidis Arhakis, Miltiadis A Makrygiannakis, Kostis Giannakopoulos, Eleftherios G Kaklamanos

{"title":"评估六种大型语言模型在儿科牙科中的循证潜力：一项关于生成人工智能的比较研究。","authors":"Anastasia Dermata, Aristidis Arhakis, Miltiadis A Makrygiannakis, Kostis Giannakopoulos, Eleftherios G Kaklamanos","doi":"10.1007/s40368-025-01012-x","DOIUrl":null,"url":null,"abstract":"Purpose: The use of large language models (LLMs) in generative artificial intelligence (AI) is rapidly increasing in dentistry. However, their reliability is yet to be fully founded. This study aims to evaluate the diagnostic accuracy, clinical applicability, and patient education potential of LLMs in paediatric dentistry, by evaluating the responses of six LLMs: Google AI's Gemini and Gemini Advanced, OpenAI's ChatGPT-3.5, -4o and -4, and Microsoft's Copilot.Methods: Ten open-type clinical questions, relevant to paediatric dentistry were posed to the LLMs. The responses were graded by two independent evaluators from 0 to 10 using a detailed rubric. After 4 weeks, answers were reevaluated to assess intra-evaluator reliability. Statistical comparisons used Friedman's and Wilcoxon's and Kruskal-Wallis tests to assess the model that provided the most comprehensive, accurate, explicit and relevant answers.Results: Variations of results were noted. Chat GPT 4 answers were scored as the best (average score 8.08), followed by the answers of Gemini Advanced (8.06), ChatGPT 4o (8.01), ChatGPT 3.5 (7.61), Gemini (7,32) and Copilot (5.41). Statistical analysis revealed that Chat GPT 4 outperformed all other LLMs, and the difference was statistically significant. Despite variations and different responses to the same queries, remarkable similarities were observed. Except for Copilot, all chatbots managed to achieve a score level above 6.5 on all queries.Conclusion: This study demonstrates the potential use of language models (LLMs) in supporting evidence-based paediatric dentistry. Nevertheless, they cannot be regarded as completely trustworthy. Dental professionals should critically use AI models as supportive tools and not as a substitute of overall scientific knowledge and critical thinking.","PeriodicalId":47603,"journal":{"name":"European Archives of Paediatric Dentistry","volume":" ","pages":"527-535"},"PeriodicalIF":2.3000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12165978/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence.\",\"authors\":\"Anastasia Dermata, Aristidis Arhakis, Miltiadis A Makrygiannakis, Kostis Giannakopoulos, Eleftherios G Kaklamanos\",\"doi\":\"10.1007/s40368-025-01012-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: The use of large language models (LLMs) in generative artificial intelligence (AI) is rapidly increasing in dentistry. However, their reliability is yet to be fully founded. This study aims to evaluate the diagnostic accuracy, clinical applicability, and patient education potential of LLMs in paediatric dentistry, by evaluating the responses of six LLMs: Google AI's Gemini and Gemini Advanced, OpenAI's ChatGPT-3.5, -4o and -4, and Microsoft's Copilot.Methods: Ten open-type clinical questions, relevant to paediatric dentistry were posed to the LLMs. The responses were graded by two independent evaluators from 0 to 10 using a detailed rubric. After 4 weeks, answers were reevaluated to assess intra-evaluator reliability. Statistical comparisons used Friedman's and Wilcoxon's and Kruskal-Wallis tests to assess the model that provided the most comprehensive, accurate, explicit and relevant answers.Results: Variations of results were noted. Chat GPT 4 answers were scored as the best (average score 8.08), followed by the answers of Gemini Advanced (8.06), ChatGPT 4o (8.01), ChatGPT 3.5 (7.61), Gemini (7,32) and Copilot (5.41). Statistical analysis revealed that Chat GPT 4 outperformed all other LLMs, and the difference was statistically significant. Despite variations and different responses to the same queries, remarkable similarities were observed. Except for Copilot, all chatbots managed to achieve a score level above 6.5 on all queries.Conclusion: This study demonstrates the potential use of language models (LLMs) in supporting evidence-based paediatric dentistry. Nevertheless, they cannot be regarded as completely trustworthy. Dental professionals should critically use AI models as supportive tools and not as a substitute of overall scientific knowledge and critical thinking.\",\"PeriodicalId\":47603,\"journal\":{\"name\":\"European Archives of Paediatric Dentistry\",\"volume\":\" \",\"pages\":\"527-535\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12165978/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Archives of Paediatric Dentistry\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s40368-025-01012-x\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/22 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Archives of Paediatric Dentistry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s40368-025-01012-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/22 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

摘要

目的：大型语言模型（llm）在生成人工智能（AI）中的使用在牙科领域迅速增加。然而，它们的可靠性尚未得到充分证实。本研究旨在通过评估b谷歌AI的Gemini和Gemini Advanced、OpenAI的ChatGPT-3.5、- 40和-4以及微软的Copilot六种llm的反应，评估llm在儿科牙科中的诊断准确性、临床适用性和患者教育潜力。方法：向法学硕士提出10个与儿科牙科相关的开放式临床问题。这些回答由两位独立的评价者用一个详细的标准从0到10打分。4周后，重新评估答案以评估评估者内部的信度。统计比较使用了弗里德曼、威尔考克森和克鲁斯卡尔-沃利斯的测试来评估提供最全面、准确、明确和相关答案的模型。结果：注意到结果的变化。ChatGPT 4的答案得分最高（平均得分8.08），其次是Gemini Advanced（8.06）、ChatGPT 40（8.01）、ChatGPT 3.5（7.61）、Gemini（7.32）和Copilot（5.41）。统计分析显示，Chat GPT 4优于所有其他llm，并且差异具有统计学意义。尽管对同样的问题有不同的回答，但我们观察到显著的相似之处。除了Copilot，所有聊天机器人在所有问题上的得分都在6.5以上。结论：本研究证明了语言模型（LLMs）在支持循证儿科牙科方面的潜在用途。然而，他们不能被认为是完全值得信赖的。牙科专业人员应该批判性地使用人工智能模型作为辅助工具，而不是作为整体科学知识和批判性思维的替代品。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence.

Purpose: The use of large language models (LLMs) in generative artificial intelligence (AI) is rapidly increasing in dentistry. However, their reliability is yet to be fully founded. This study aims to evaluate the diagnostic accuracy, clinical applicability, and patient education potential of LLMs in paediatric dentistry, by evaluating the responses of six LLMs: Google AI's Gemini and Gemini Advanced, OpenAI's ChatGPT-3.5, -4o and -4, and Microsoft's Copilot.

Methods: Ten open-type clinical questions, relevant to paediatric dentistry were posed to the LLMs. The responses were graded by two independent evaluators from 0 to 10 using a detailed rubric. After 4 weeks, answers were reevaluated to assess intra-evaluator reliability. Statistical comparisons used Friedman's and Wilcoxon's and Kruskal-Wallis tests to assess the model that provided the most comprehensive, accurate, explicit and relevant answers.

Results: Variations of results were noted. Chat GPT 4 answers were scored as the best (average score 8.08), followed by the answers of Gemini Advanced (8.06), ChatGPT 4o (8.01), ChatGPT 3.5 (7.61), Gemini (7,32) and Copilot (5.41). Statistical analysis revealed that Chat GPT 4 outperformed all other LLMs, and the difference was statistically significant. Despite variations and different responses to the same queries, remarkable similarities were observed. Except for Copilot, all chatbots managed to achieve a score level above 6.5 on all queries.

Conclusion: This study demonstrates the potential use of language models (LLMs) in supporting evidence-based paediatric dentistry. Nevertheless, they cannot be regarded as completely trustworthy. Dental professionals should critically use AI models as supportive tools and not as a substitute of overall scientific knowledge and critical thinking.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

European Archives of Paediatric Dentistry DENTISTRY, ORAL SURGERY & MEDICINE-

CiteScore

4.40

自引率

9.10%

发文量

期刊介绍： The aim and scope of European Archives of Paediatric Dentistry (EAPD) is to promote research in all aspects of dentistry for children, including interceptive orthodontics and studies on children and young adults with special needs. The EAPD focuses on the publication and critical evaluation of clinical and basic science research related to children. The EAPD will consider clinical case series reports, followed by the relevant literature review, only where there are new and important findings of interest to Paediatric Dentistry and where details of techniques or treatment carried out and the success of such approaches are given.