Georgios S Chatzopoulos, Vasiliki P Koidou, Lazaros Tsalikis, Eleftherios G Kaklamanos
{"title":"大语言模型在牙周功能缺损治疗临床问题回答中的表现评价。","authors":"Georgios S Chatzopoulos, Vasiliki P Koidou, Lazaros Tsalikis, Eleftherios G Kaklamanos","doi":"10.3390/dj13060271","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background/Objectives</b>: Large Language Models (LLMs) are artificial intelligence (AI) systems with the capacity to process vast amounts of text and generate human-like language, offering the potential for improved information retrieval in healthcare. This study aimed to assess and compare the evidence-based potential of answers provided by four LLMs to common clinical questions concerning the management and treatment of periodontal furcation defects. <b>Methods</b>: Four LLMs-ChatGPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot-were used to answer ten clinical questions related to periodontal furcation defects. The LLM-generated responses were compared against a \"gold standard\" derived from the European Federation of Periodontology (EFP) S3 guidelines and recent systematic reviews. Two board-certified periodontists independently evaluated the answers for comprehensiveness, scientific accuracy, clarity, and relevance using a predefined rubric and a scoring system of 0-10. <b>Results</b>: The study found variability in LLM performance across the evaluation criteria. Google Gemini Advanced generally achieved the highest average scores, particularly in comprehensiveness and clarity, while Google Gemini and Microsoft Copilot tended to score lower, especially in relevance. However, the Kruskal-Wallis test revealed no statistically significant differences in the overall average scores among the LLMs. Evaluator agreement and intra-evaluator reliability were high. <b>Conclusions</b>: While LLMs demonstrate the potential to answer clinical questions related to furcation defect management, their performance varies. LLMs showed different comprehensiveness, scientific accuracy, clarity, and relevance degrees. Dental professionals should be aware of LLMs' capabilities and limitations when seeking clinical information.</p>","PeriodicalId":11269,"journal":{"name":"Dentistry Journal","volume":"13 6","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12191798/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluation of Large Language Model Performance in Answering Clinical Questions on Periodontal Furcation Defect Management.\",\"authors\":\"Georgios S Chatzopoulos, Vasiliki P Koidou, Lazaros Tsalikis, Eleftherios G Kaklamanos\",\"doi\":\"10.3390/dj13060271\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p><b>Background/Objectives</b>: Large Language Models (LLMs) are artificial intelligence (AI) systems with the capacity to process vast amounts of text and generate human-like language, offering the potential for improved information retrieval in healthcare. This study aimed to assess and compare the evidence-based potential of answers provided by four LLMs to common clinical questions concerning the management and treatment of periodontal furcation defects. <b>Methods</b>: Four LLMs-ChatGPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot-were used to answer ten clinical questions related to periodontal furcation defects. The LLM-generated responses were compared against a \\\"gold standard\\\" derived from the European Federation of Periodontology (EFP) S3 guidelines and recent systematic reviews. Two board-certified periodontists independently evaluated the answers for comprehensiveness, scientific accuracy, clarity, and relevance using a predefined rubric and a scoring system of 0-10. <b>Results</b>: The study found variability in LLM performance across the evaluation criteria. Google Gemini Advanced generally achieved the highest average scores, particularly in comprehensiveness and clarity, while Google Gemini and Microsoft Copilot tended to score lower, especially in relevance. However, the Kruskal-Wallis test revealed no statistically significant differences in the overall average scores among the LLMs. Evaluator agreement and intra-evaluator reliability were high. <b>Conclusions</b>: While LLMs demonstrate the potential to answer clinical questions related to furcation defect management, their performance varies. LLMs showed different comprehensiveness, scientific accuracy, clarity, and relevance degrees. Dental professionals should be aware of LLMs' capabilities and limitations when seeking clinical information.</p>\",\"PeriodicalId\":11269,\"journal\":{\"name\":\"Dentistry Journal\",\"volume\":\"13 6\",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12191798/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Dentistry Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/dj13060271\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dentistry Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/dj13060271","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
Evaluation of Large Language Model Performance in Answering Clinical Questions on Periodontal Furcation Defect Management.
Background/Objectives: Large Language Models (LLMs) are artificial intelligence (AI) systems with the capacity to process vast amounts of text and generate human-like language, offering the potential for improved information retrieval in healthcare. This study aimed to assess and compare the evidence-based potential of answers provided by four LLMs to common clinical questions concerning the management and treatment of periodontal furcation defects. Methods: Four LLMs-ChatGPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot-were used to answer ten clinical questions related to periodontal furcation defects. The LLM-generated responses were compared against a "gold standard" derived from the European Federation of Periodontology (EFP) S3 guidelines and recent systematic reviews. Two board-certified periodontists independently evaluated the answers for comprehensiveness, scientific accuracy, clarity, and relevance using a predefined rubric and a scoring system of 0-10. Results: The study found variability in LLM performance across the evaluation criteria. Google Gemini Advanced generally achieved the highest average scores, particularly in comprehensiveness and clarity, while Google Gemini and Microsoft Copilot tended to score lower, especially in relevance. However, the Kruskal-Wallis test revealed no statistically significant differences in the overall average scores among the LLMs. Evaluator agreement and intra-evaluator reliability were high. Conclusions: While LLMs demonstrate the potential to answer clinical questions related to furcation defect management, their performance varies. LLMs showed different comprehensiveness, scientific accuracy, clarity, and relevance degrees. Dental professionals should be aware of LLMs' capabilities and limitations when seeking clinical information.