Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR).

IF 5.7 3区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Systems Pub Date : 2025-06-12 DOI:10.1007/s10916-025-02212-0

Diego A Forero, Sandra E Abreu, Blanca E Tovar, Marilyn H Oermann

{"title":"Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR).","authors":"Diego A Forero, Sandra E Abreu, Blanca E Tovar, Marilyn H Oermann","doi":"10.1007/s10916-025-02212-0","DOIUrl":null,"url":null,"abstract":"<p><p>In the context of Evidence-Based Practice (EBP), Systematic Reviews (SRs), Meta-Analyses (MAs) and overview of reviews have become cornerstones for the synthesis of research findings. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 and Preferred Reporting Items for Overviews of Reviews (PRIOR) statements have become major reporting guidelines for SRs/MAs and for overviews of reviews, respectively. In recent years, advances in Generative Artificial Intelligence (genAI) have been proposed as a potential major paradigm shift in scientific research. The main aim of this research was to examine the performance of four LLMs for the analysis of adherence to PRISMA 2020 and PRIOR, in a sample of 20 SRs and 20 overviews of reviews. We tested the free versions of four commonly used LLMs: ChatGPT (GPT-4o), DeepSeek (V3), Gemini (2.0 Flash) and Qwen (2.5 Max). Adherence to PRISMA 2020 and PRIOR was compared with scores defined previously by human experts, using several statistical tests. In our results, all the four LLMs showed a low performance for the analysis of adherence to PRISMA 2020, overestimating the percentage of adherence (from 23 to 30%). For PRIOR, the LLMs presented lower differences in the estimation of adherence (from 6 to 14%) and ChatGPT showed a performance similar to human experts. This is the first report of the performance of four commonly used LLMs for the analysis of adherence to PRISMA 2020 and PRIOR. Future studies of adherence to other reporting guidelines will be helpful in health sciences research.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"80"},"PeriodicalIF":5.7000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12162794/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10916-025-02212-0","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

In the context of Evidence-Based Practice (EBP), Systematic Reviews (SRs), Meta-Analyses (MAs) and overview of reviews have become cornerstones for the synthesis of research findings. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 and Preferred Reporting Items for Overviews of Reviews (PRIOR) statements have become major reporting guidelines for SRs/MAs and for overviews of reviews, respectively. In recent years, advances in Generative Artificial Intelligence (genAI) have been proposed as a potential major paradigm shift in scientific research. The main aim of this research was to examine the performance of four LLMs for the analysis of adherence to PRISMA 2020 and PRIOR, in a sample of 20 SRs and 20 overviews of reviews. We tested the free versions of four commonly used LLMs: ChatGPT (GPT-4o), DeepSeek (V3), Gemini (2.0 Flash) and Qwen (2.5 Max). Adherence to PRISMA 2020 and PRIOR was compared with scores defined previously by human experts, using several statistical tests. In our results, all the four LLMs showed a low performance for the analysis of adherence to PRISMA 2020, overestimating the percentage of adherence (from 23 to 30%). For PRIOR, the LLMs presented lower differences in the estimation of adherence (from 6 to 14%) and ChatGPT showed a performance similar to human experts. This is the first report of the performance of four commonly used LLMs for the analysis of adherence to PRISMA 2020 and PRIOR. Future studies of adherence to other reporting guidelines will be helpful in health sciences research.

Abstract Image

查看原文本刊更多论文

大型语言模型和系统评论和评论概述（PRISMA 2020和PRIOR）中遵守报告指南的分析。

在循证实践（EBP）的背景下，系统评价（SRs）、元分析（MAs）和综述已成为综合研究成果的基石。系统评价和荟萃分析的首选报告项目（PRISMA） 2020和审查概述的首选报告项目（PRIOR）声明已分别成为SRs/ ma和审查概述的主要报告指南。近年来，生成式人工智能（genAI）的进展被认为是科学研究中潜在的重大范式转变。本研究的主要目的是在20份SRs和20份综述的样本中检查4个llm的表现，以分析对PRISMA 2020和PRIOR的依从性。我们测试了四种常用llm的免费版本：ChatGPT (gpt - 40), DeepSeek (V3), Gemini （2.0 Flash）和Qwen （2.5 Max）。使用几种统计测试，比较人类专家先前定义的分数对PRISMA 2020和PRIOR的依从性。在我们的研究结果中，所有四家llm在对PRISMA 2020的依从性分析中表现不佳，高估了依从性的百分比（从23%到30%）。对于PRIOR， llm在依从性估计上的差异较小（从6%到14%），ChatGPT的表现与人类专家相似。这是第一份关于四种常用llm的性能报告，用于分析PRISMA 2020和PRIOR的依从性。今后对遵守其他报告准则的研究将有助于健康科学研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Systems 医学-卫生保健

CiteScore

11.60

自引率

1.90%

发文量

审稿时长

4.8 months

期刊介绍： Journal of Medical Systems provides a forum for the presentation and discussion of the increasingly extensive applications of new systems techniques and methods in hospital clinic and physician''s office administration; pathology radiology and pharmaceutical delivery systems; medical records storage and retrieval; and ancillary patient-support systems. The journal publishes informative articles essays and studies across the entire scale of medical systems from large hospital programs to novel small-scale medical services. Education is an integral part of this amalgamation of sciences and selected articles are published in this area. Since existing medical systems are constantly being modified to fit particular circumstances and to solve specific problems the journal includes a special section devoted to status reports on current installations.