DeepSeek-R1与chatgpt - 40在口腔颌面外科领域技术及患者相关问答质量的比较研究

IF 1.8
Yunus Balel
{"title":"DeepSeek-R1与chatgpt - 40在口腔颌面外科领域技术及患者相关问答质量的比较研究","authors":"Yunus Balel","doi":"10.1007/s10006-025-01464-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial Intelligence (AI) technologies demonstrate potential as supplementary tools in healthcare, particularly in surgery, where they assist with preoperative planning, intraoperative decisions, and postoperative monitoring. In oral and maxillofacial surgery, integrating AI poses unique opportunities and challenges due to its complex anatomical and functional demands.</p><p><strong>Objective: </strong>This study compares the performance of two AI language models, DeepSeek-R1 and ChatGPT-4o, in addressing technical and patient-related inquiries in oral and maxillofacial surgery.</p><p><strong>Methods: </strong>A dataset of 120 questions, including 60 technical and 60 patient-related queries, was developed based on prior studies. These questions covered impacted teeth, dental implants, temporomandibular joint disorders, and orthognathic surgery. Responses from DeepSeek-R1 and ChatGPT-4o were randomized and evaluated using the Modified Global Quality Scale (GQS). Statistical analysis was conducted using non-parametric tests, such as the Wilcoxon Signed-Rank Test and Kruskal-Wallis H Test, with a significance threshold of p = 0.05.</p><p><strong>Results: </strong>The mean GQS score for DeepSeek-R1 was 4.53 ± 0.95, compared to ChatGPT-4o's mean score of 4.39 ± 1.14. DeepSeek-R1 achieved a mean GQS of 4.87 in patient-related inquiries, such as orthognathic surgery and dental implants, compared to 4.73 for ChatGPT-4o. In contrast, ChatGPT-4o received higher average scores in technical questions related to temporomandibular joint disorders. Across all 120 questions, the two models had no statistically significant difference in performance (p = 0.270). In comparisons with previous models, ScholarGPT demonstrated higher performance than the other models. While this performance difference was not statistically significant compared to DeepSeek-R1 (P = 0.121), it was statistically significantly higher compared to ChatGPT-4o and ChatGPT-3.5 (P = 0.027 and P < 0.001, respectively).</p><p><strong>Conclusions: </strong>DeepSeek-R1 and ChatGPT-4o provide comparable performance in addressing patient and technical inquiries in oral and maxillofacial surgery, with small variations depending on the question category. Although statistical differences were not significant, incremental improvements in AI models' response quality were observed. Future research should focus on enhancing their reliability and applicability in clinical settings.</p>","PeriodicalId":520733,"journal":{"name":"Oral and maxillofacial surgery","volume":"29 1","pages":"163"},"PeriodicalIF":1.8000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparative study of technical and patient-related question answering quality of DeepSeek-R1 and ChatGPT-4o in the field of oral and maxillofacial surgery.\",\"authors\":\"Yunus Balel\",\"doi\":\"10.1007/s10006-025-01464-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Artificial Intelligence (AI) technologies demonstrate potential as supplementary tools in healthcare, particularly in surgery, where they assist with preoperative planning, intraoperative decisions, and postoperative monitoring. In oral and maxillofacial surgery, integrating AI poses unique opportunities and challenges due to its complex anatomical and functional demands.</p><p><strong>Objective: </strong>This study compares the performance of two AI language models, DeepSeek-R1 and ChatGPT-4o, in addressing technical and patient-related inquiries in oral and maxillofacial surgery.</p><p><strong>Methods: </strong>A dataset of 120 questions, including 60 technical and 60 patient-related queries, was developed based on prior studies. These questions covered impacted teeth, dental implants, temporomandibular joint disorders, and orthognathic surgery. Responses from DeepSeek-R1 and ChatGPT-4o were randomized and evaluated using the Modified Global Quality Scale (GQS). Statistical analysis was conducted using non-parametric tests, such as the Wilcoxon Signed-Rank Test and Kruskal-Wallis H Test, with a significance threshold of p = 0.05.</p><p><strong>Results: </strong>The mean GQS score for DeepSeek-R1 was 4.53 ± 0.95, compared to ChatGPT-4o's mean score of 4.39 ± 1.14. DeepSeek-R1 achieved a mean GQS of 4.87 in patient-related inquiries, such as orthognathic surgery and dental implants, compared to 4.73 for ChatGPT-4o. In contrast, ChatGPT-4o received higher average scores in technical questions related to temporomandibular joint disorders. Across all 120 questions, the two models had no statistically significant difference in performance (p = 0.270). In comparisons with previous models, ScholarGPT demonstrated higher performance than the other models. While this performance difference was not statistically significant compared to DeepSeek-R1 (P = 0.121), it was statistically significantly higher compared to ChatGPT-4o and ChatGPT-3.5 (P = 0.027 and P < 0.001, respectively).</p><p><strong>Conclusions: </strong>DeepSeek-R1 and ChatGPT-4o provide comparable performance in addressing patient and technical inquiries in oral and maxillofacial surgery, with small variations depending on the question category. Although statistical differences were not significant, incremental improvements in AI models' response quality were observed. Future research should focus on enhancing their reliability and applicability in clinical settings.</p>\",\"PeriodicalId\":520733,\"journal\":{\"name\":\"Oral and maxillofacial surgery\",\"volume\":\"29 1\",\"pages\":\"163\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Oral and maxillofacial surgery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s10006-025-01464-x\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Oral and maxillofacial surgery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10006-025-01464-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景:人工智能(AI)技术显示出作为医疗保健辅助工具的潜力,特别是在外科手术中,它们有助于术前计划、术中决策和术后监测。在口腔颌面外科中,由于其复杂的解剖和功能需求,集成人工智能带来了独特的机遇和挑战。目的:比较DeepSeek-R1和chatgpt - 40两种人工智能语言模型在解决口腔颌面外科技术和患者相关问题方面的性能。方法:基于先前的研究,开发了一个包含120个问题的数据集,其中包括60个技术问题和60个与患者相关的问题。这些问题包括埋伏牙、种植牙、颞下颌关节紊乱和正颌手术。DeepSeek-R1和chatgpt - 40的回复被随机化,并使用改良的全球质量量表(GQS)进行评估。统计学分析采用非参数检验,如Wilcoxon Signed-Rank检验和Kruskal-Wallis H检验,显著性阈值为p = 0.05。结果:DeepSeek-R1的GQS平均评分为4.53±0.95,chatgpt - 40的GQS平均评分为4.39±1.14。与chatgpt - 40相比,DeepSeek-R1在与患者相关的查询(如正颌手术和牙科植入物)中的平均GQS为4.87,而chatgpt - 40为4.73。相比之下,chatgpt - 40在颞下颌关节疾病相关的技术问题上获得了更高的平均分数。在所有120个问题中,两种模型在性能上没有统计学上的显著差异(p = 0.270)。与以前的模型比较,ScholarGPT表现出比其他模型更高的性能。虽然与DeepSeek-R1相比,这一性能差异没有统计学意义(P = 0.121),但与chatgpt - 40和ChatGPT-3.5相比,这一性能差异具有统计学意义(P = 0.027和P)。结论:DeepSeek-R1和chatgpt - 40在解决口腔颌面外科患者和技术咨询方面提供了相当的性能,根据问题类别的不同,差异很小。虽然统计差异不显著,但观察到人工智能模型的响应质量有了渐进式的改善。未来的研究应侧重于提高其在临床环境中的可靠性和适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Comparative study of technical and patient-related question answering quality of DeepSeek-R1 and ChatGPT-4o in the field of oral and maxillofacial surgery.

Background: Artificial Intelligence (AI) technologies demonstrate potential as supplementary tools in healthcare, particularly in surgery, where they assist with preoperative planning, intraoperative decisions, and postoperative monitoring. In oral and maxillofacial surgery, integrating AI poses unique opportunities and challenges due to its complex anatomical and functional demands.

Objective: This study compares the performance of two AI language models, DeepSeek-R1 and ChatGPT-4o, in addressing technical and patient-related inquiries in oral and maxillofacial surgery.

Methods: A dataset of 120 questions, including 60 technical and 60 patient-related queries, was developed based on prior studies. These questions covered impacted teeth, dental implants, temporomandibular joint disorders, and orthognathic surgery. Responses from DeepSeek-R1 and ChatGPT-4o were randomized and evaluated using the Modified Global Quality Scale (GQS). Statistical analysis was conducted using non-parametric tests, such as the Wilcoxon Signed-Rank Test and Kruskal-Wallis H Test, with a significance threshold of p = 0.05.

Results: The mean GQS score for DeepSeek-R1 was 4.53 ± 0.95, compared to ChatGPT-4o's mean score of 4.39 ± 1.14. DeepSeek-R1 achieved a mean GQS of 4.87 in patient-related inquiries, such as orthognathic surgery and dental implants, compared to 4.73 for ChatGPT-4o. In contrast, ChatGPT-4o received higher average scores in technical questions related to temporomandibular joint disorders. Across all 120 questions, the two models had no statistically significant difference in performance (p = 0.270). In comparisons with previous models, ScholarGPT demonstrated higher performance than the other models. While this performance difference was not statistically significant compared to DeepSeek-R1 (P = 0.121), it was statistically significantly higher compared to ChatGPT-4o and ChatGPT-3.5 (P = 0.027 and P < 0.001, respectively).

Conclusions: DeepSeek-R1 and ChatGPT-4o provide comparable performance in addressing patient and technical inquiries in oral and maxillofacial surgery, with small variations depending on the question category. Although statistical differences were not significant, incremental improvements in AI models' response quality were observed. Future research should focus on enhancing their reliability and applicability in clinical settings.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信