Anastasia Dermata, Aristidis Arhakis, Miltiadis A Makrygiannakis, Kostis Giannakopoulos, Eleftherios G Kaklamanos
{"title":"评估六种大型语言模型在儿科牙科中的循证潜力:一项关于生成人工智能的比较研究。","authors":"Anastasia Dermata, Aristidis Arhakis, Miltiadis A Makrygiannakis, Kostis Giannakopoulos, Eleftherios G Kaklamanos","doi":"10.1007/s40368-025-01012-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>The use of large language models (LLMs) in generative artificial intelligence (AI) is rapidly increasing in dentistry. However, their reliability is yet to be fully founded. This study aims to evaluate the diagnostic accuracy, clinical applicability, and patient education potential of LLMs in paediatric dentistry, by evaluating the responses of six LLMs: Google AI's Gemini and Gemini Advanced, OpenAI's ChatGPT-3.5, -4o and -4, and Microsoft's Copilot.</p><p><strong>Methods: </strong>Ten open-type clinical questions, relevant to paediatric dentistry were posed to the LLMs. The responses were graded by two independent evaluators from 0 to 10 using a detailed rubric. After 4 weeks, answers were reevaluated to assess intra-evaluator reliability. Statistical comparisons used Friedman's and Wilcoxon's and Kruskal-Wallis tests to assess the model that provided the most comprehensive, accurate, explicit and relevant answers.</p><p><strong>Results: </strong>Variations of results were noted. Chat GPT 4 answers were scored as the best (average score 8.08), followed by the answers of Gemini Advanced (8.06), ChatGPT 4o (8.01), ChatGPT 3.5 (7.61), Gemini (7,32) and Copilot (5.41). Statistical analysis revealed that Chat GPT 4 outperformed all other LLMs, and the difference was statistically significant. Despite variations and different responses to the same queries, remarkable similarities were observed. Except for Copilot, all chatbots managed to achieve a score level above 6.5 on all queries.</p><p><strong>Conclusion: </strong>This study demonstrates the potential use of language models (LLMs) in supporting evidence-based paediatric dentistry. Nevertheless, they cannot be regarded as completely trustworthy. Dental professionals should critically use AI models as supportive tools and not as a substitute of overall scientific knowledge and critical thinking.</p>","PeriodicalId":47603,"journal":{"name":"European Archives of Paediatric Dentistry","volume":" ","pages":"527-535"},"PeriodicalIF":2.3000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12165978/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence.\",\"authors\":\"Anastasia Dermata, Aristidis Arhakis, Miltiadis A Makrygiannakis, Kostis Giannakopoulos, Eleftherios G Kaklamanos\",\"doi\":\"10.1007/s40368-025-01012-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>The use of large language models (LLMs) in generative artificial intelligence (AI) is rapidly increasing in dentistry. However, their reliability is yet to be fully founded. This study aims to evaluate the diagnostic accuracy, clinical applicability, and patient education potential of LLMs in paediatric dentistry, by evaluating the responses of six LLMs: Google AI's Gemini and Gemini Advanced, OpenAI's ChatGPT-3.5, -4o and -4, and Microsoft's Copilot.</p><p><strong>Methods: </strong>Ten open-type clinical questions, relevant to paediatric dentistry were posed to the LLMs. The responses were graded by two independent evaluators from 0 to 10 using a detailed rubric. After 4 weeks, answers were reevaluated to assess intra-evaluator reliability. Statistical comparisons used Friedman's and Wilcoxon's and Kruskal-Wallis tests to assess the model that provided the most comprehensive, accurate, explicit and relevant answers.</p><p><strong>Results: </strong>Variations of results were noted. Chat GPT 4 answers were scored as the best (average score 8.08), followed by the answers of Gemini Advanced (8.06), ChatGPT 4o (8.01), ChatGPT 3.5 (7.61), Gemini (7,32) and Copilot (5.41). Statistical analysis revealed that Chat GPT 4 outperformed all other LLMs, and the difference was statistically significant. Despite variations and different responses to the same queries, remarkable similarities were observed. Except for Copilot, all chatbots managed to achieve a score level above 6.5 on all queries.</p><p><strong>Conclusion: </strong>This study demonstrates the potential use of language models (LLMs) in supporting evidence-based paediatric dentistry. Nevertheless, they cannot be regarded as completely trustworthy. Dental professionals should critically use AI models as supportive tools and not as a substitute of overall scientific knowledge and critical thinking.</p>\",\"PeriodicalId\":47603,\"journal\":{\"name\":\"European Archives of Paediatric Dentistry\",\"volume\":\" \",\"pages\":\"527-535\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2025-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12165978/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Archives of Paediatric Dentistry\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s40368-025-01012-x\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/22 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Archives of Paediatric Dentistry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s40368-025-01012-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/22 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence.
Purpose: The use of large language models (LLMs) in generative artificial intelligence (AI) is rapidly increasing in dentistry. However, their reliability is yet to be fully founded. This study aims to evaluate the diagnostic accuracy, clinical applicability, and patient education potential of LLMs in paediatric dentistry, by evaluating the responses of six LLMs: Google AI's Gemini and Gemini Advanced, OpenAI's ChatGPT-3.5, -4o and -4, and Microsoft's Copilot.
Methods: Ten open-type clinical questions, relevant to paediatric dentistry were posed to the LLMs. The responses were graded by two independent evaluators from 0 to 10 using a detailed rubric. After 4 weeks, answers were reevaluated to assess intra-evaluator reliability. Statistical comparisons used Friedman's and Wilcoxon's and Kruskal-Wallis tests to assess the model that provided the most comprehensive, accurate, explicit and relevant answers.
Results: Variations of results were noted. Chat GPT 4 answers were scored as the best (average score 8.08), followed by the answers of Gemini Advanced (8.06), ChatGPT 4o (8.01), ChatGPT 3.5 (7.61), Gemini (7,32) and Copilot (5.41). Statistical analysis revealed that Chat GPT 4 outperformed all other LLMs, and the difference was statistically significant. Despite variations and different responses to the same queries, remarkable similarities were observed. Except for Copilot, all chatbots managed to achieve a score level above 6.5 on all queries.
Conclusion: This study demonstrates the potential use of language models (LLMs) in supporting evidence-based paediatric dentistry. Nevertheless, they cannot be regarded as completely trustworthy. Dental professionals should critically use AI models as supportive tools and not as a substitute of overall scientific knowledge and critical thinking.
期刊介绍:
The aim and scope of European Archives of Paediatric Dentistry (EAPD) is to promote research in all aspects of dentistry for children, including interceptive orthodontics and studies on children and young adults with special needs. The EAPD focuses on the publication and critical evaluation of clinical and basic science research related to children. The EAPD will consider clinical case series reports, followed by the relevant literature review, only where there are new and important findings of interest to Paediatric Dentistry and where details of techniques or treatment carried out and the success of such approaches are given.