{"title":"DeepSeek-R1与chatgpt - 40在口腔颌面外科领域技术及患者相关问答质量的比较研究","authors":"Yunus Balel","doi":"10.1007/s10006-025-01464-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Artificial Intelligence (AI) technologies demonstrate potential as supplementary tools in healthcare, particularly in surgery, where they assist with preoperative planning, intraoperative decisions, and postoperative monitoring. In oral and maxillofacial surgery, integrating AI poses unique opportunities and challenges due to its complex anatomical and functional demands.</p><p><strong>Objective: </strong>This study compares the performance of two AI language models, DeepSeek-R1 and ChatGPT-4o, in addressing technical and patient-related inquiries in oral and maxillofacial surgery.</p><p><strong>Methods: </strong>A dataset of 120 questions, including 60 technical and 60 patient-related queries, was developed based on prior studies. These questions covered impacted teeth, dental implants, temporomandibular joint disorders, and orthognathic surgery. Responses from DeepSeek-R1 and ChatGPT-4o were randomized and evaluated using the Modified Global Quality Scale (GQS). Statistical analysis was conducted using non-parametric tests, such as the Wilcoxon Signed-Rank Test and Kruskal-Wallis H Test, with a significance threshold of p = 0.05.</p><p><strong>Results: </strong>The mean GQS score for DeepSeek-R1 was 4.53 ± 0.95, compared to ChatGPT-4o's mean score of 4.39 ± 1.14. DeepSeek-R1 achieved a mean GQS of 4.87 in patient-related inquiries, such as orthognathic surgery and dental implants, compared to 4.73 for ChatGPT-4o. In contrast, ChatGPT-4o received higher average scores in technical questions related to temporomandibular joint disorders. Across all 120 questions, the two models had no statistically significant difference in performance (p = 0.270). In comparisons with previous models, ScholarGPT demonstrated higher performance than the other models. While this performance difference was not statistically significant compared to DeepSeek-R1 (P = 0.121), it was statistically significantly higher compared to ChatGPT-4o and ChatGPT-3.5 (P = 0.027 and P < 0.001, respectively).</p><p><strong>Conclusions: </strong>DeepSeek-R1 and ChatGPT-4o provide comparable performance in addressing patient and technical inquiries in oral and maxillofacial surgery, with small variations depending on the question category. Although statistical differences were not significant, incremental improvements in AI models' response quality were observed. Future research should focus on enhancing their reliability and applicability in clinical settings.</p>","PeriodicalId":520733,"journal":{"name":"Oral and maxillofacial surgery","volume":"29 1","pages":"163"},"PeriodicalIF":1.8000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparative study of technical and patient-related question answering quality of DeepSeek-R1 and ChatGPT-4o in the field of oral and maxillofacial surgery.\",\"authors\":\"Yunus Balel\",\"doi\":\"10.1007/s10006-025-01464-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Artificial Intelligence (AI) technologies demonstrate potential as supplementary tools in healthcare, particularly in surgery, where they assist with preoperative planning, intraoperative decisions, and postoperative monitoring. In oral and maxillofacial surgery, integrating AI poses unique opportunities and challenges due to its complex anatomical and functional demands.</p><p><strong>Objective: </strong>This study compares the performance of two AI language models, DeepSeek-R1 and ChatGPT-4o, in addressing technical and patient-related inquiries in oral and maxillofacial surgery.</p><p><strong>Methods: </strong>A dataset of 120 questions, including 60 technical and 60 patient-related queries, was developed based on prior studies. These questions covered impacted teeth, dental implants, temporomandibular joint disorders, and orthognathic surgery. Responses from DeepSeek-R1 and ChatGPT-4o were randomized and evaluated using the Modified Global Quality Scale (GQS). Statistical analysis was conducted using non-parametric tests, such as the Wilcoxon Signed-Rank Test and Kruskal-Wallis H Test, with a significance threshold of p = 0.05.</p><p><strong>Results: </strong>The mean GQS score for DeepSeek-R1 was 4.53 ± 0.95, compared to ChatGPT-4o's mean score of 4.39 ± 1.14. DeepSeek-R1 achieved a mean GQS of 4.87 in patient-related inquiries, such as orthognathic surgery and dental implants, compared to 4.73 for ChatGPT-4o. In contrast, ChatGPT-4o received higher average scores in technical questions related to temporomandibular joint disorders. Across all 120 questions, the two models had no statistically significant difference in performance (p = 0.270). In comparisons with previous models, ScholarGPT demonstrated higher performance than the other models. While this performance difference was not statistically significant compared to DeepSeek-R1 (P = 0.121), it was statistically significantly higher compared to ChatGPT-4o and ChatGPT-3.5 (P = 0.027 and P < 0.001, respectively).</p><p><strong>Conclusions: </strong>DeepSeek-R1 and ChatGPT-4o provide comparable performance in addressing patient and technical inquiries in oral and maxillofacial surgery, with small variations depending on the question category. Although statistical differences were not significant, incremental improvements in AI models' response quality were observed. Future research should focus on enhancing their reliability and applicability in clinical settings.</p>\",\"PeriodicalId\":520733,\"journal\":{\"name\":\"Oral and maxillofacial surgery\",\"volume\":\"29 1\",\"pages\":\"163\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Oral and maxillofacial surgery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s10006-025-01464-x\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Oral and maxillofacial surgery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10006-025-01464-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Comparative study of technical and patient-related question answering quality of DeepSeek-R1 and ChatGPT-4o in the field of oral and maxillofacial surgery.
Background: Artificial Intelligence (AI) technologies demonstrate potential as supplementary tools in healthcare, particularly in surgery, where they assist with preoperative planning, intraoperative decisions, and postoperative monitoring. In oral and maxillofacial surgery, integrating AI poses unique opportunities and challenges due to its complex anatomical and functional demands.
Objective: This study compares the performance of two AI language models, DeepSeek-R1 and ChatGPT-4o, in addressing technical and patient-related inquiries in oral and maxillofacial surgery.
Methods: A dataset of 120 questions, including 60 technical and 60 patient-related queries, was developed based on prior studies. These questions covered impacted teeth, dental implants, temporomandibular joint disorders, and orthognathic surgery. Responses from DeepSeek-R1 and ChatGPT-4o were randomized and evaluated using the Modified Global Quality Scale (GQS). Statistical analysis was conducted using non-parametric tests, such as the Wilcoxon Signed-Rank Test and Kruskal-Wallis H Test, with a significance threshold of p = 0.05.
Results: The mean GQS score for DeepSeek-R1 was 4.53 ± 0.95, compared to ChatGPT-4o's mean score of 4.39 ± 1.14. DeepSeek-R1 achieved a mean GQS of 4.87 in patient-related inquiries, such as orthognathic surgery and dental implants, compared to 4.73 for ChatGPT-4o. In contrast, ChatGPT-4o received higher average scores in technical questions related to temporomandibular joint disorders. Across all 120 questions, the two models had no statistically significant difference in performance (p = 0.270). In comparisons with previous models, ScholarGPT demonstrated higher performance than the other models. While this performance difference was not statistically significant compared to DeepSeek-R1 (P = 0.121), it was statistically significantly higher compared to ChatGPT-4o and ChatGPT-3.5 (P = 0.027 and P < 0.001, respectively).
Conclusions: DeepSeek-R1 and ChatGPT-4o provide comparable performance in addressing patient and technical inquiries in oral and maxillofacial surgery, with small variations depending on the question category. Although statistical differences were not significant, incremental improvements in AI models' response quality were observed. Future research should focus on enhancing their reliability and applicability in clinical settings.