牙周病学中的大型语言模型:评估其在临床相关问题中的表现。

IF 4.3 2区 医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE
Georgios S Chatzopoulos, Vasiliki P Koidou, Lazaros Tsalikis, Eleftherios G Kaklamanos
{"title":"牙周病学中的大型语言模型:评估其在临床相关问题中的表现。","authors":"Georgios S Chatzopoulos, Vasiliki P Koidou, Lazaros Tsalikis, Eleftherios G Kaklamanos","doi":"10.1016/j.prosdent.2024.10.020","DOIUrl":null,"url":null,"abstract":"<p><strong>Statement of problem: </strong>Although the use of artificial intelligence (AI) seems promising and may assist dentists in clinical practice, the consequences of inaccurate or even harmful responses are paramount. Research is required to examine whether large language models (LLMs) can be used in accessing periodontal content reliably.</p><p><strong>Purpose: </strong>The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to common clinical questions in the field of periodontology.</p><p><strong>Material and methods: </strong>A total of 10 open-ended questions pertinent to periodontology were posed to 4 distinct LLMs: ChatGPT model GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers to each question were evaluated independently by 2 periodontists against robust scientific evidence based on a predefined rubric assessing the comprehensiveness, scientific accuracy, clarity, and relevance. Each response received a score ranging from 0 (minimum) to 10 (maximum). After a period of 2 weeks from initial evaluation, the answers were re-graded independently to gauge intra-evaluator reliability. Inter-evaluator reliability was assessed using correlation tests, while Cronbach alpha and interclass correlation coefficient were used to measure overall reliability. The Kruskal-Wallis test was employed to compare the scores given by different LLMs.</p><p><strong>Results: </strong>The scores provided by the 2 evaluators for both evaluations were statistically similar (P values ranging from .083 to >;.999), therefore an average score was calculated for each LLM. Both evaluators gave the highest scores to the answers generated by ChatGPT 4.0, while Google Gemini had the lowest scores. ChatGPT 4.0 received the highest average score, while significant differences were detected between ChatGPT 4.0 and Google Gemini (P=.042). ChatGPT 4.0 answers were found to be highly comprehensive, with scientific accuracy, clarity, and relevance.</p><p><strong>Conclusions: </strong>Professionals need to be aware of the limitations of LLMs when utilizing them. These models must not replace dental professionals as improper use may negatively impact patient care. Chat GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft CoPilot performed relatively well with Chat GPT 4.0 demonstrating the highest performance.</p>","PeriodicalId":16866,"journal":{"name":"Journal of Prosthetic Dentistry","volume":" ","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large language models in periodontology: Assessing their performance in clinically relevant questions.\",\"authors\":\"Georgios S Chatzopoulos, Vasiliki P Koidou, Lazaros Tsalikis, Eleftherios G Kaklamanos\",\"doi\":\"10.1016/j.prosdent.2024.10.020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Statement of problem: </strong>Although the use of artificial intelligence (AI) seems promising and may assist dentists in clinical practice, the consequences of inaccurate or even harmful responses are paramount. Research is required to examine whether large language models (LLMs) can be used in accessing periodontal content reliably.</p><p><strong>Purpose: </strong>The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to common clinical questions in the field of periodontology.</p><p><strong>Material and methods: </strong>A total of 10 open-ended questions pertinent to periodontology were posed to 4 distinct LLMs: ChatGPT model GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers to each question were evaluated independently by 2 periodontists against robust scientific evidence based on a predefined rubric assessing the comprehensiveness, scientific accuracy, clarity, and relevance. Each response received a score ranging from 0 (minimum) to 10 (maximum). After a period of 2 weeks from initial evaluation, the answers were re-graded independently to gauge intra-evaluator reliability. Inter-evaluator reliability was assessed using correlation tests, while Cronbach alpha and interclass correlation coefficient were used to measure overall reliability. The Kruskal-Wallis test was employed to compare the scores given by different LLMs.</p><p><strong>Results: </strong>The scores provided by the 2 evaluators for both evaluations were statistically similar (P values ranging from .083 to >;.999), therefore an average score was calculated for each LLM. Both evaluators gave the highest scores to the answers generated by ChatGPT 4.0, while Google Gemini had the lowest scores. ChatGPT 4.0 received the highest average score, while significant differences were detected between ChatGPT 4.0 and Google Gemini (P=.042). ChatGPT 4.0 answers were found to be highly comprehensive, with scientific accuracy, clarity, and relevance.</p><p><strong>Conclusions: </strong>Professionals need to be aware of the limitations of LLMs when utilizing them. These models must not replace dental professionals as improper use may negatively impact patient care. Chat GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft CoPilot performed relatively well with Chat GPT 4.0 demonstrating the highest performance.</p>\",\"PeriodicalId\":16866,\"journal\":{\"name\":\"Journal of Prosthetic Dentistry\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-11-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Prosthetic Dentistry\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.prosdent.2024.10.020\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Prosthetic Dentistry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.prosdent.2024.10.020","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0

摘要

问题陈述:尽管人工智能(AI)的使用似乎很有前景,并可能在临床实践中为牙医提供帮助,但不准确甚至有害的回答所带来的后果是最重要的。目的:本研究的目的是评估和比较 4 个大型语言模型(LLMs)对牙周病学领域常见临床问题所提供答案的循证潜力:向 4 位不同的 LLM 提出了共 10 个与牙周病学相关的开放式问题:ChatGPT model GPT 4.0、Google Gemini、Google Gemini Advanced 和 Microsoft Copilot。每个问题的答案都由两名牙周病专家根据可靠的科学证据进行独立评估,评估标准是预先确定的,评估内容的全面性、科学准确性、清晰度和相关性。每个回答的得分从 0 分(最低分)到 10 分(最高分)不等。在初始评估的两周后,对答案进行独立重新评分,以评估评估者内部的可靠性。评估者之间的信度采用相关测试,而 Cronbach alpha 和类间相关系数则用于衡量总体信度。Kruskal-Wallis 检验用于比较不同法律硕士的评分:两位评估员对两项评估的打分在统计上相似(P 值从 0.083 到 >;.999),因此计算出了每位当地语言学专家的平均分。两位评估员都给 ChatGPT 4.0 生成的答案打了最高分,而 Google Gemini 的分数最低。ChatGPT 4.0 的平均得分最高,而 ChatGPT 4.0 和 Google Gemini 之间存在显著差异(P=.042)。研究发现,ChatGPT 4.0 的答案非常全面,具有科学性、准确性、清晰性和相关性:专业人士在使用 LLM 时需要注意其局限性。结论:专业人员在使用 LLM 时需要注意其局限性,这些模型不能取代牙科专业人员,因为使用不当可能会对患者护理产生负面影响。Chat GPT 4.0、Google Gemini、Google Gemini Advanced 和 Microsoft CoPilot 的性能相对较好,其中 Chat GPT 4.0 的性能最高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Large language models in periodontology: Assessing their performance in clinically relevant questions.

Statement of problem: Although the use of artificial intelligence (AI) seems promising and may assist dentists in clinical practice, the consequences of inaccurate or even harmful responses are paramount. Research is required to examine whether large language models (LLMs) can be used in accessing periodontal content reliably.

Purpose: The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to common clinical questions in the field of periodontology.

Material and methods: A total of 10 open-ended questions pertinent to periodontology were posed to 4 distinct LLMs: ChatGPT model GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers to each question were evaluated independently by 2 periodontists against robust scientific evidence based on a predefined rubric assessing the comprehensiveness, scientific accuracy, clarity, and relevance. Each response received a score ranging from 0 (minimum) to 10 (maximum). After a period of 2 weeks from initial evaluation, the answers were re-graded independently to gauge intra-evaluator reliability. Inter-evaluator reliability was assessed using correlation tests, while Cronbach alpha and interclass correlation coefficient were used to measure overall reliability. The Kruskal-Wallis test was employed to compare the scores given by different LLMs.

Results: The scores provided by the 2 evaluators for both evaluations were statistically similar (P values ranging from .083 to >;.999), therefore an average score was calculated for each LLM. Both evaluators gave the highest scores to the answers generated by ChatGPT 4.0, while Google Gemini had the lowest scores. ChatGPT 4.0 received the highest average score, while significant differences were detected between ChatGPT 4.0 and Google Gemini (P=.042). ChatGPT 4.0 answers were found to be highly comprehensive, with scientific accuracy, clarity, and relevance.

Conclusions: Professionals need to be aware of the limitations of LLMs when utilizing them. These models must not replace dental professionals as improper use may negatively impact patient care. Chat GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft CoPilot performed relatively well with Chat GPT 4.0 demonstrating the highest performance.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Prosthetic Dentistry
Journal of Prosthetic Dentistry 医学-牙科与口腔外科
CiteScore
7.00
自引率
13.00%
发文量
599
审稿时长
69 days
期刊介绍: The Journal of Prosthetic Dentistry is the leading professional journal devoted exclusively to prosthetic and restorative dentistry. The Journal is the official publication for 24 leading U.S. international prosthodontic organizations. The monthly publication features timely, original peer-reviewed articles on the newest techniques, dental materials, and research findings. The Journal serves prosthodontists and dentists in advanced practice, and features color photos that illustrate many step-by-step procedures. The Journal of Prosthetic Dentistry is included in Index Medicus and CINAHL.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信