Georgios S Chatzopoulos, Vasiliki P Koidou, Lazaros Tsalikis, Eleftherios G Kaklamanos
{"title":"牙周病学中的大型语言模型:评估其在临床相关问题中的表现。","authors":"Georgios S Chatzopoulos, Vasiliki P Koidou, Lazaros Tsalikis, Eleftherios G Kaklamanos","doi":"10.1016/j.prosdent.2024.10.020","DOIUrl":null,"url":null,"abstract":"<p><strong>Statement of problem: </strong>Although the use of artificial intelligence (AI) seems promising and may assist dentists in clinical practice, the consequences of inaccurate or even harmful responses are paramount. Research is required to examine whether large language models (LLMs) can be used in accessing periodontal content reliably.</p><p><strong>Purpose: </strong>The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to common clinical questions in the field of periodontology.</p><p><strong>Material and methods: </strong>A total of 10 open-ended questions pertinent to periodontology were posed to 4 distinct LLMs: ChatGPT model GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers to each question were evaluated independently by 2 periodontists against robust scientific evidence based on a predefined rubric assessing the comprehensiveness, scientific accuracy, clarity, and relevance. Each response received a score ranging from 0 (minimum) to 10 (maximum). After a period of 2 weeks from initial evaluation, the answers were re-graded independently to gauge intra-evaluator reliability. Inter-evaluator reliability was assessed using correlation tests, while Cronbach alpha and interclass correlation coefficient were used to measure overall reliability. The Kruskal-Wallis test was employed to compare the scores given by different LLMs.</p><p><strong>Results: </strong>The scores provided by the 2 evaluators for both evaluations were statistically similar (P values ranging from .083 to >;.999), therefore an average score was calculated for each LLM. Both evaluators gave the highest scores to the answers generated by ChatGPT 4.0, while Google Gemini had the lowest scores. ChatGPT 4.0 received the highest average score, while significant differences were detected between ChatGPT 4.0 and Google Gemini (P=.042). ChatGPT 4.0 answers were found to be highly comprehensive, with scientific accuracy, clarity, and relevance.</p><p><strong>Conclusions: </strong>Professionals need to be aware of the limitations of LLMs when utilizing them. These models must not replace dental professionals as improper use may negatively impact patient care. Chat GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft CoPilot performed relatively well with Chat GPT 4.0 demonstrating the highest performance.</p>","PeriodicalId":16866,"journal":{"name":"Journal of Prosthetic Dentistry","volume":" ","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large language models in periodontology: Assessing their performance in clinically relevant questions.\",\"authors\":\"Georgios S Chatzopoulos, Vasiliki P Koidou, Lazaros Tsalikis, Eleftherios G Kaklamanos\",\"doi\":\"10.1016/j.prosdent.2024.10.020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Statement of problem: </strong>Although the use of artificial intelligence (AI) seems promising and may assist dentists in clinical practice, the consequences of inaccurate or even harmful responses are paramount. Research is required to examine whether large language models (LLMs) can be used in accessing periodontal content reliably.</p><p><strong>Purpose: </strong>The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to common clinical questions in the field of periodontology.</p><p><strong>Material and methods: </strong>A total of 10 open-ended questions pertinent to periodontology were posed to 4 distinct LLMs: ChatGPT model GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers to each question were evaluated independently by 2 periodontists against robust scientific evidence based on a predefined rubric assessing the comprehensiveness, scientific accuracy, clarity, and relevance. Each response received a score ranging from 0 (minimum) to 10 (maximum). After a period of 2 weeks from initial evaluation, the answers were re-graded independently to gauge intra-evaluator reliability. Inter-evaluator reliability was assessed using correlation tests, while Cronbach alpha and interclass correlation coefficient were used to measure overall reliability. The Kruskal-Wallis test was employed to compare the scores given by different LLMs.</p><p><strong>Results: </strong>The scores provided by the 2 evaluators for both evaluations were statistically similar (P values ranging from .083 to >;.999), therefore an average score was calculated for each LLM. Both evaluators gave the highest scores to the answers generated by ChatGPT 4.0, while Google Gemini had the lowest scores. ChatGPT 4.0 received the highest average score, while significant differences were detected between ChatGPT 4.0 and Google Gemini (P=.042). ChatGPT 4.0 answers were found to be highly comprehensive, with scientific accuracy, clarity, and relevance.</p><p><strong>Conclusions: </strong>Professionals need to be aware of the limitations of LLMs when utilizing them. These models must not replace dental professionals as improper use may negatively impact patient care. Chat GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft CoPilot performed relatively well with Chat GPT 4.0 demonstrating the highest performance.</p>\",\"PeriodicalId\":16866,\"journal\":{\"name\":\"Journal of Prosthetic Dentistry\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-11-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Prosthetic Dentistry\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.prosdent.2024.10.020\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Prosthetic Dentistry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.prosdent.2024.10.020","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
Large language models in periodontology: Assessing their performance in clinically relevant questions.
Statement of problem: Although the use of artificial intelligence (AI) seems promising and may assist dentists in clinical practice, the consequences of inaccurate or even harmful responses are paramount. Research is required to examine whether large language models (LLMs) can be used in accessing periodontal content reliably.
Purpose: The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to common clinical questions in the field of periodontology.
Material and methods: A total of 10 open-ended questions pertinent to periodontology were posed to 4 distinct LLMs: ChatGPT model GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers to each question were evaluated independently by 2 periodontists against robust scientific evidence based on a predefined rubric assessing the comprehensiveness, scientific accuracy, clarity, and relevance. Each response received a score ranging from 0 (minimum) to 10 (maximum). After a period of 2 weeks from initial evaluation, the answers were re-graded independently to gauge intra-evaluator reliability. Inter-evaluator reliability was assessed using correlation tests, while Cronbach alpha and interclass correlation coefficient were used to measure overall reliability. The Kruskal-Wallis test was employed to compare the scores given by different LLMs.
Results: The scores provided by the 2 evaluators for both evaluations were statistically similar (P values ranging from .083 to >;.999), therefore an average score was calculated for each LLM. Both evaluators gave the highest scores to the answers generated by ChatGPT 4.0, while Google Gemini had the lowest scores. ChatGPT 4.0 received the highest average score, while significant differences were detected between ChatGPT 4.0 and Google Gemini (P=.042). ChatGPT 4.0 answers were found to be highly comprehensive, with scientific accuracy, clarity, and relevance.
Conclusions: Professionals need to be aware of the limitations of LLMs when utilizing them. These models must not replace dental professionals as improper use may negatively impact patient care. Chat GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft CoPilot performed relatively well with Chat GPT 4.0 demonstrating the highest performance.
期刊介绍:
The Journal of Prosthetic Dentistry is the leading professional journal devoted exclusively to prosthetic and restorative dentistry. The Journal is the official publication for 24 leading U.S. international prosthodontic organizations. The monthly publication features timely, original peer-reviewed articles on the newest techniques, dental materials, and research findings. The Journal serves prosthodontists and dentists in advanced practice, and features color photos that illustrate many step-by-step procedures. The Journal of Prosthetic Dentistry is included in Index Medicus and CINAHL.