{"title":"Comparing the performance of ChatGPT, DeepSeek, and Gemini in systematic and umbrella review tasks over time.","authors":"Maryam Emami, Mohammadjavad Shirani","doi":"10.1016/j.adaj.2025.08.011","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>This study aimed to compare the performance of ChatGPT-4o (OpenAI), DeepSeek-V3 (High-Flyer), and Gemini 1.5 Pro (Google) during 3 consecutive weeks in performing full-text screening, data extraction, and risk of bias assessment tasks in systematic and umbrella reviews.</p><p><strong>Methods: </strong>This study evaluated the correctness of large language model (LLM) responses in performing review study tasks by prompting 3 independent accounts. This process was repeated during 3 consecutive weeks for 40 primary studies. The correctness of responses was scored, and data were analyzed by Kendall W, generalized estimating equations followed by pairwise comparisons with Bonferroni correction, and Mann-Whitney U tests (α = .05).</p><p><strong>Results: </strong>DeepSeek achieved the highest data extraction accuracy (> 90%), followed by ChatGPT (> 88%). Moreover, DeepSeek outperformed significantly in data extraction compared with Gemini in most pairwise comparisons (P < .0167). Gemini showed an improvement in data extraction performance over time, with significantly higher accuracy in the third week than in the first week (P < .0167). ChatGPT generally performed better in systematic reviews than in umbrella reviews (P < .05).</p><p><strong>Conclusions: </strong>The studied LLMs showed potential for accurate data extraction, particularly DeepSeek, but consistently had unreliable performance in critical tasks like full-text screening and risk of bias assessment. LLM applications in review studies require cautious expert supervision.</p><p><strong>Practical implications: </strong>Researchers planning to use LLMs for review study tasks should be aware that LLM responses to full-text screening and risk of bias assessment are unreliable. DeepSeek is the preferred LLM for data extraction in both systematic and umbrella reviews, whereas ChatGPT is recommended for systematic reviews.</p>","PeriodicalId":17197,"journal":{"name":"Journal of the American Dental Association","volume":" ","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Dental Association","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.adaj.2025.08.011","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DENTISTRY, ORAL SURGERY & MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Background: This study aimed to compare the performance of ChatGPT-4o (OpenAI), DeepSeek-V3 (High-Flyer), and Gemini 1.5 Pro (Google) during 3 consecutive weeks in performing full-text screening, data extraction, and risk of bias assessment tasks in systematic and umbrella reviews.
Methods: This study evaluated the correctness of large language model (LLM) responses in performing review study tasks by prompting 3 independent accounts. This process was repeated during 3 consecutive weeks for 40 primary studies. The correctness of responses was scored, and data were analyzed by Kendall W, generalized estimating equations followed by pairwise comparisons with Bonferroni correction, and Mann-Whitney U tests (α = .05).
Results: DeepSeek achieved the highest data extraction accuracy (> 90%), followed by ChatGPT (> 88%). Moreover, DeepSeek outperformed significantly in data extraction compared with Gemini in most pairwise comparisons (P < .0167). Gemini showed an improvement in data extraction performance over time, with significantly higher accuracy in the third week than in the first week (P < .0167). ChatGPT generally performed better in systematic reviews than in umbrella reviews (P < .05).
Conclusions: The studied LLMs showed potential for accurate data extraction, particularly DeepSeek, but consistently had unreliable performance in critical tasks like full-text screening and risk of bias assessment. LLM applications in review studies require cautious expert supervision.
Practical implications: Researchers planning to use LLMs for review study tasks should be aware that LLM responses to full-text screening and risk of bias assessment are unreliable. DeepSeek is the preferred LLM for data extraction in both systematic and umbrella reviews, whereas ChatGPT is recommended for systematic reviews.
期刊介绍:
There is not a single source or solution to help dentists in their quest for lifelong learning, improving dental practice, and dental well-being. JADA+, along with The Journal of the American Dental Association, is striving to do just that, bringing together practical content covering dentistry topics and procedures to help dentists—both general dentists and specialists—provide better patient care and improve oral health and well-being. This is a work in progress; as we add more content, covering more topics of interest, it will continue to expand, becoming an ever-more essential source of oral health knowledge.