比较ChatGPT、DeepSeek和Gemini在系统和总体审查任务中的表现。

IF 3.5 2区医学 Q1 DENTISTRY, ORAL SURGERY & MEDICINE

Journal of the American Dental Association Pub Date : 2025-10-15 DOI:10.1016/j.adaj.2025.08.011

Maryam Emami, Mohammadjavad Shirani

{"title":"比较ChatGPT、DeepSeek和Gemini在系统和总体审查任务中的表现。","authors":"Maryam Emami, Mohammadjavad Shirani","doi":"10.1016/j.adaj.2025.08.011","DOIUrl":null,"url":null,"abstract":"Background: This study aimed to compare the performance of ChatGPT-4o (OpenAI), DeepSeek-V3 (High-Flyer), and Gemini 1.5 Pro (Google) during 3 consecutive weeks in performing full-text screening, data extraction, and risk of bias assessment tasks in systematic and umbrella reviews.Methods: This study evaluated the correctness of large language model (LLM) responses in performing review study tasks by prompting 3 independent accounts. This process was repeated during 3 consecutive weeks for 40 primary studies. The correctness of responses was scored, and data were analyzed by Kendall W, generalized estimating equations followed by pairwise comparisons with Bonferroni correction, and Mann-Whitney U tests (α = .05).Results: DeepSeek achieved the highest data extraction accuracy (> 90%), followed by ChatGPT (> 88%). Moreover, DeepSeek outperformed significantly in data extraction compared with Gemini in most pairwise comparisons (P < .0167). Gemini showed an improvement in data extraction performance over time, with significantly higher accuracy in the third week than in the first week (P < .0167). ChatGPT generally performed better in systematic reviews than in umbrella reviews (P < .05).Conclusions: The studied LLMs showed potential for accurate data extraction, particularly DeepSeek, but consistently had unreliable performance in critical tasks like full-text screening and risk of bias assessment. LLM applications in review studies require cautious expert supervision.Practical implications: Researchers planning to use LLMs for review study tasks should be aware that LLM responses to full-text screening and risk of bias assessment are unreliable. DeepSeek is the preferred LLM for data extraction in both systematic and umbrella reviews, whereas ChatGPT is recommended for systematic reviews.","PeriodicalId":17197,"journal":{"name":"Journal of the American Dental Association","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing the performance of ChatGPT, DeepSeek, and Gemini in systematic and umbrella review tasks over time.\",\"authors\":\"Maryam Emami, Mohammadjavad Shirani\",\"doi\":\"10.1016/j.adaj.2025.08.011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: This study aimed to compare the performance of ChatGPT-4o (OpenAI), DeepSeek-V3 (High-Flyer), and Gemini 1.5 Pro (Google) during 3 consecutive weeks in performing full-text screening, data extraction, and risk of bias assessment tasks in systematic and umbrella reviews.Methods: This study evaluated the correctness of large language model (LLM) responses in performing review study tasks by prompting 3 independent accounts. This process was repeated during 3 consecutive weeks for 40 primary studies. The correctness of responses was scored, and data were analyzed by Kendall W, generalized estimating equations followed by pairwise comparisons with Bonferroni correction, and Mann-Whitney U tests (α = .05).Results: DeepSeek achieved the highest data extraction accuracy (> 90%), followed by ChatGPT (> 88%). Moreover, DeepSeek outperformed significantly in data extraction compared with Gemini in most pairwise comparisons (P < .0167). Gemini showed an improvement in data extraction performance over time, with significantly higher accuracy in the third week than in the first week (P < .0167). ChatGPT generally performed better in systematic reviews than in umbrella reviews (P < .05).Conclusions: The studied LLMs showed potential for accurate data extraction, particularly DeepSeek, but consistently had unreliable performance in critical tasks like full-text screening and risk of bias assessment. LLM applications in review studies require cautious expert supervision.Practical implications: Researchers planning to use LLMs for review study tasks should be aware that LLM responses to full-text screening and risk of bias assessment are unreliable. DeepSeek is the preferred LLM for data extraction in both systematic and umbrella reviews, whereas ChatGPT is recommended for systematic reviews.\",\"PeriodicalId\":17197,\"journal\":{\"name\":\"Journal of the American Dental Association\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Dental Association\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.adaj.2025.08.011\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"DENTISTRY, ORAL SURGERY & MEDICINE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Dental Association","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.adaj.2025.08.011","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}

引用次数: 0

摘要

背景：本研究旨在比较chatgpt - 40 （OpenAI）、DeepSeek-V3 （High-Flyer）和Gemini 1.5 Pro（谷歌）在连续3周内进行全文筛选、数据提取和系统评价和概括性评价中偏置风险评估任务的表现。方法：本研究通过提示3个独立帐户来评估大语言模型（LLM）在执行复习学习任务时反应的正确性。这一过程在连续3周内重复了40个初步研究。对回答的正确性进行评分，并采用Kendall W、广义估计方程、Bonferroni校正两两比较和Mann-Whitney U检验对数据进行分析（α = 0.05）。结果：DeepSeek获得了最高的数据提取准确率（> 90%），其次是ChatGPT（> 88%）。此外，在大多数两两比较中，DeepSeek在数据提取方面明显优于Gemini （P < .0167）。随着时间的推移，Gemini的数据提取性能有所改善，第三周的准确性明显高于第一周（P < 0.0167）。ChatGPT在系统评价中的总体表现优于伞形评价（P < 0.05）。结论：所研究的llm显示出准确数据提取的潜力，特别是DeepSeek，但在全文筛选和风险偏差评估等关键任务中始终表现不可靠。法学硕士在审查研究中的应用需要谨慎的专家监督。实际意义：计划使用法学硕士进行综述研究任务的研究人员应该意识到，法学硕士对全文筛选和偏倚风险评估的反应是不可靠的。在系统审查和伞形审查中，DeepSeek是数据提取的首选法学硕士，而在系统审查中，推荐使用ChatGPT。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparing the performance of ChatGPT, DeepSeek, and Gemini in systematic and umbrella review tasks over time.

Background: This study aimed to compare the performance of ChatGPT-4o (OpenAI), DeepSeek-V3 (High-Flyer), and Gemini 1.5 Pro (Google) during 3 consecutive weeks in performing full-text screening, data extraction, and risk of bias assessment tasks in systematic and umbrella reviews.

Methods: This study evaluated the correctness of large language model (LLM) responses in performing review study tasks by prompting 3 independent accounts. This process was repeated during 3 consecutive weeks for 40 primary studies. The correctness of responses was scored, and data were analyzed by Kendall W, generalized estimating equations followed by pairwise comparisons with Bonferroni correction, and Mann-Whitney U tests (α = .05).

Results: DeepSeek achieved the highest data extraction accuracy (> 90%), followed by ChatGPT (> 88%). Moreover, DeepSeek outperformed significantly in data extraction compared with Gemini in most pairwise comparisons (P < .0167). Gemini showed an improvement in data extraction performance over time, with significantly higher accuracy in the third week than in the first week (P < .0167). ChatGPT generally performed better in systematic reviews than in umbrella reviews (P < .05).

Conclusions: The studied LLMs showed potential for accurate data extraction, particularly DeepSeek, but consistently had unreliable performance in critical tasks like full-text screening and risk of bias assessment. LLM applications in review studies require cautious expert supervision.

Practical implications: Researchers planning to use LLMs for review study tasks should be aware that LLM responses to full-text screening and risk of bias assessment are unreliable. DeepSeek is the preferred LLM for data extraction in both systematic and umbrella reviews, whereas ChatGPT is recommended for systematic reviews.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the American Dental Association 医学-牙科与口腔外科

CiteScore

5.30

自引率

10.30%

发文量

221

审稿时长

34 days

期刊介绍： There is not a single source or solution to help dentists in their quest for lifelong learning, improving dental practice, and dental well-being. JADA+, along with The Journal of the American Dental Association, is striving to do just that, bringing together practical content covering dentistry topics and procedures to help dentists—both general dentists and specialists—provide better patient care and improve oral health and well-being. This is a work in progress; as we add more content, covering more topics of interest, it will continue to expand, becoming an ever-more essential source of oral health knowledge.