比较ChatGPT 40、DeepSeek R1和Gemini 2 Pro在回答固定修复问题时的表现。

IF 4.3 2区医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE

Journal of Prosthetic Dentistry Pub Date : 2025-05-22 DOI:10.1016/j.prosdent.2025.04.038

Mohammadjavad Shirani

{"title":"比较ChatGPT 40、DeepSeek R1和Gemini 2 Pro在回答固定修复问题时的表现。","authors":"Mohammadjavad Shirani","doi":"10.1016/j.prosdent.2025.04.038","DOIUrl":null,"url":null,"abstract":"Statement of problem: The accuracy of DeepSeek and the latest versions of ChatGPT and Gemini in responding to prosthodontics questions needs to be evaluated. Additionally, the extent to which the performance of these chatbots changes through user interactions remains unexplored.Purpose: The purpose of this longitudinal repeated-measures experimental study was to compare the performance of ChatGPT (4o), DeepSeek (R1), and Gemini (2 Pro) in answering multiple-choice (MC) and short-answer (SA) fixed prosthodontics questions over 4 consecutive weeks after exposure to correct responses.Material and methods: A total of 40 questions (20 MC and 20 SA) were developed based on the sixth edition of Contemporary Fixed Prosthodontics. Following a standardized protocol, these questions were posed to ChatGPT, DeepSeek, and Gemini on 4 consecutive Saturdays using 10 independent accounts per chatbot. After each session, correct answers were provided to the chatbots, and, before the next session, their memory and history were cleared. Responses were scored as correct (1) or incorrect (0) for MC questions and correct (2), partially correct (1), or incorrect (0) for SA questions. Weighted accuracy was calculated accordingly. The Kendall W coefficient was used to assess agreement among the 10 accounts per chatbot. The effects of chatbot type, time (week), and their interaction on performance were analyzed using generalized estimating equations (GEEs), followed by pairwise comparisons using the Mann-Whitney U test and Wilcoxon signed-rank test with Bonferroni adjustments for multiple comparisons (α=.05).Results: All chatbots showed significant reproducibility, with Gemini exhibiting the highest repeatability for SA questions, followed by ChatGPT for MC questions. Accuracy ranged between 43% and 71%. ChatGPT and DeepSeek demonstrated significantly better performance in MC questions compared with Gemini (P<.017). However, in the third week, Gemini outperformed DeepSeek in SA questions (P=.007). Over time, Gemini showed continuous improvement in SA questions, whereas DeepSeek exhibited a performance surge in the fourth week. ChatGPT's performance remained stable throughout the study period.Conclusions: The overall accuracy of the studied chatbots in answering MC and SA prosthodontics questions was not satisfactory. Among them, ChatGPT was the most reliable for MC questions, while ChatGPT and Gemini performed best for SA questions. Gemini (for SA questions) and DeepSeek (for MC and SA questions) demonstrated improvement after exposure to correct responses.","PeriodicalId":16866,"journal":{"name":"Journal of Prosthetic Dentistry","volume":" ","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing the performance of ChatGPT 4o, DeepSeek R1, and Gemini 2 Pro in answering fixed prosthodontics questions over time.\",\"authors\":\"Mohammadjavad Shirani\",\"doi\":\"10.1016/j.prosdent.2025.04.038\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Statement of problem: The accuracy of DeepSeek and the latest versions of ChatGPT and Gemini in responding to prosthodontics questions needs to be evaluated. Additionally, the extent to which the performance of these chatbots changes through user interactions remains unexplored.Purpose: The purpose of this longitudinal repeated-measures experimental study was to compare the performance of ChatGPT (4o), DeepSeek (R1), and Gemini (2 Pro) in answering multiple-choice (MC) and short-answer (SA) fixed prosthodontics questions over 4 consecutive weeks after exposure to correct responses.Material and methods: A total of 40 questions (20 MC and 20 SA) were developed based on the sixth edition of Contemporary Fixed Prosthodontics. Following a standardized protocol, these questions were posed to ChatGPT, DeepSeek, and Gemini on 4 consecutive Saturdays using 10 independent accounts per chatbot. After each session, correct answers were provided to the chatbots, and, before the next session, their memory and history were cleared. Responses were scored as correct (1) or incorrect (0) for MC questions and correct (2), partially correct (1), or incorrect (0) for SA questions. Weighted accuracy was calculated accordingly. The Kendall W coefficient was used to assess agreement among the 10 accounts per chatbot. The effects of chatbot type, time (week), and their interaction on performance were analyzed using generalized estimating equations (GEEs), followed by pairwise comparisons using the Mann-Whitney U test and Wilcoxon signed-rank test with Bonferroni adjustments for multiple comparisons (α=.05).Results: All chatbots showed significant reproducibility, with Gemini exhibiting the highest repeatability for SA questions, followed by ChatGPT for MC questions. Accuracy ranged between 43% and 71%. ChatGPT and DeepSeek demonstrated significantly better performance in MC questions compared with Gemini (P<.017). However, in the third week, Gemini outperformed DeepSeek in SA questions (P=.007). Over time, Gemini showed continuous improvement in SA questions, whereas DeepSeek exhibited a performance surge in the fourth week. ChatGPT's performance remained stable throughout the study period.Conclusions: The overall accuracy of the studied chatbots in answering MC and SA prosthodontics questions was not satisfactory. Among them, ChatGPT was the most reliable for MC questions, while ChatGPT and Gemini performed best for SA questions. Gemini (for SA questions) and DeepSeek (for MC and SA questions) demonstrated improvement after exposure to correct responses.\",\"PeriodicalId\":16866,\"journal\":{\"name\":\"Journal of Prosthetic Dentistry\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Prosthetic Dentistry\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.prosdent.2025.04.038\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Prosthetic Dentistry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.prosdent.2025.04.038","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

摘要

问题陈述：需要评估DeepSeek和最新版本的ChatGPT和Gemini在回答修复问题方面的准确性。此外，这些聊天机器人的性能在多大程度上通过用户交互变化仍未被探索。目的：本纵向重复测量实验研究的目的是比较ChatGPT（40）、DeepSeek （R1）和Gemini （2 Pro）在接触正确答案后连续4周回答多选题（MC）和简答题（SA）固定修复问题的表现。材料和方法：根据第六版《当代固定修复学》编制40道题（MC 20题，SA 20题）。根据标准化协议，这些问题在连续4个周六分别向ChatGPT、DeepSeek和Gemini提出，每个聊天机器人使用10个独立账户。每次会话结束后，将正确答案提供给聊天机器人，在下一个会话之前，将清除它们的记忆和历史。MC问题的回答分为正确(1)或不正确(0)，SA问题的回答分为正确(2)、部分正确(1)或不正确(0)。据此计算加权精度。Kendall W系数用于评估每个聊天机器人的10个帐户之间的一致性。使用广义估计方程（GEEs）分析聊天机器人类型、时间（周）及其相互作用对工作表现的影响，然后使用Mann-Whitney U检验和Wilcoxon符号秩检验进行两两比较，并对多重比较进行Bonferroni调整（α= 0.05）。结果：所有聊天机器人都表现出显著的可重复性，其中Gemini在SA问题上的可重复性最高，其次是ChatGPT在MC问题上的可重复性。准确率在43%到71%之间。ChatGPT和DeepSeek在MC问题上的表现明显优于Gemini (p结论：所研究的聊天机器人在回答MC和SA修复问题上的总体准确性并不令人满意。其中，ChatGPT对于MC问题的可靠性最高，而ChatGPT和Gemini对于SA问题的可靠性最高。Gemini（针对SA问题）和DeepSeek（针对MC和SA问题）在接触正确答案后表现出改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparing the performance of ChatGPT 4o, DeepSeek R1, and Gemini 2 Pro in answering fixed prosthodontics questions over time.

Statement of problem: The accuracy of DeepSeek and the latest versions of ChatGPT and Gemini in responding to prosthodontics questions needs to be evaluated. Additionally, the extent to which the performance of these chatbots changes through user interactions remains unexplored.

Purpose: The purpose of this longitudinal repeated-measures experimental study was to compare the performance of ChatGPT (4o), DeepSeek (R1), and Gemini (2 Pro) in answering multiple-choice (MC) and short-answer (SA) fixed prosthodontics questions over 4 consecutive weeks after exposure to correct responses.

Material and methods: A total of 40 questions (20 MC and 20 SA) were developed based on the sixth edition of Contemporary Fixed Prosthodontics. Following a standardized protocol, these questions were posed to ChatGPT, DeepSeek, and Gemini on 4 consecutive Saturdays using 10 independent accounts per chatbot. After each session, correct answers were provided to the chatbots, and, before the next session, their memory and history were cleared. Responses were scored as correct (1) or incorrect (0) for MC questions and correct (2), partially correct (1), or incorrect (0) for SA questions. Weighted accuracy was calculated accordingly. The Kendall W coefficient was used to assess agreement among the 10 accounts per chatbot. The effects of chatbot type, time (week), and their interaction on performance were analyzed using generalized estimating equations (GEEs), followed by pairwise comparisons using the Mann-Whitney U test and Wilcoxon signed-rank test with Bonferroni adjustments for multiple comparisons (α=.05).

Results: All chatbots showed significant reproducibility, with Gemini exhibiting the highest repeatability for SA questions, followed by ChatGPT for MC questions. Accuracy ranged between 43% and 71%. ChatGPT and DeepSeek demonstrated significantly better performance in MC questions compared with Gemini (P<.017). However, in the third week, Gemini outperformed DeepSeek in SA questions (P=.007). Over time, Gemini showed continuous improvement in SA questions, whereas DeepSeek exhibited a performance surge in the fourth week. ChatGPT's performance remained stable throughout the study period.

Conclusions: The overall accuracy of the studied chatbots in answering MC and SA prosthodontics questions was not satisfactory. Among them, ChatGPT was the most reliable for MC questions, while ChatGPT and Gemini performed best for SA questions. Gemini (for SA questions) and DeepSeek (for MC and SA questions) demonstrated improvement after exposure to correct responses.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Prosthetic Dentistry 医学-牙科与口腔外科

CiteScore

7.00

自引率

13.00%

发文量

599

审稿时长

69 days

期刊介绍： The Journal of Prosthetic Dentistry is the leading professional journal devoted exclusively to prosthetic and restorative dentistry. The Journal is the official publication for 24 leading U.S. international prosthodontic organizations. The monthly publication features timely, original peer-reviewed articles on the newest techniques, dental materials, and research findings. The Journal serves prosthodontists and dentists in advanced practice, and features color photos that illustrate many step-by-step procedures. The Journal of Prosthetic Dentistry is included in Index Medicus and CINAHL.