Large language models for conducting systematic reviews: on the rise, but not yet ready for use—a scoping review

IF 7.3 2区 医学 Q1 HEALTH CARE SCIENCES & SERVICES
Judith-Lisa Lieberum , Markus Töws , Maria-Inti Metzendorf , Felix Heilmeyer , Waldemar Siemens , Christian Haverkamp , Daniel Böhringer , Joerg J. Meerpohl , Angelika Eisele-Metzger
{"title":"Large language models for conducting systematic reviews: on the rise, but not yet ready for use—a scoping review","authors":"Judith-Lisa Lieberum ,&nbsp;Markus Töws ,&nbsp;Maria-Inti Metzendorf ,&nbsp;Felix Heilmeyer ,&nbsp;Waldemar Siemens ,&nbsp;Christian Haverkamp ,&nbsp;Daniel Böhringer ,&nbsp;Joerg J. Meerpohl ,&nbsp;Angelika Eisele-Metzger","doi":"10.1016/j.jclinepi.2025.111746","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Objectives</h3><div>Machine learning promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. We aimed at providing an overview of LLM applications in SR conduct in health research.</div></div><div><h3>Methods</h3><div>We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: February 26, 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, 1 reviewer extracted data, checked by another.</div></div><div><h3>Results</h3><div>Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (<em>n</em> = 15, 41%), study selection (<em>n</em> = 14, 38%), and data extraction (<em>n</em> = 11, 30%). The mostly recurring LLM was Generative Pretrained Transformer (GPT) (<em>n</em> = 33, 89%). Validation studies were predominant (<em>n</em> = 21, 57%). In half of the studies, authors evaluated LLM use as promising (<em>n</em> = 20, 54%), one-quarter as neutral (<em>n</em> = 9, 24%) and one-fifth as nonpromising (<em>n</em> = 8, 22%).</div></div><div><h3>Conclusion</h3><div>Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance.</div></div><div><h3>Plain Language Summary</h3><div>Systematic reviews are a crucial tool in health research where experts carefully collect and analyze all available evidence on a specific research question. Creating these reviews is typically time- and resource-intensive, often taking months or even years to complete, as researchers must thoroughly search, evaluate, and synthesize an immense number of scientific studies. For the present article, we conducted a review to understand how new artificial intelligence (AI) tools, specifically large language models (LLMs) like Generative Pretrained Transformer (GPT), can be used to help create systematic reviews in health research. We searched multiple scientific databases and finally found 37 relevant articles. We found that LLMs have been tested to help with various parts of the systematic review process, particularly in 3 main areas: searching scientific literature (41% of studies), selecting relevant studies (38%), and extracting important information from these studies (30%). GPT was the most commonly used LLM, appearing in 89% of the studies. Most of the research (57%) focused on testing whether these AI tools actually work as intended in this context of systematic review production. The results were mixed: about half of the studies found LLMs promising, a quarter were neutral, and one-fifth found them not promising. While LLMs show potential for making the systematic review process more efficient, there is still a lack of fully tested and validated applications. However, the increasing number of studies in this field suggests that these AI tools are becoming increasingly important in creating systematic reviews.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"181 ","pages":"Article 111746"},"PeriodicalIF":7.3000,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895435625000794","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background and Objectives

Machine learning promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. We aimed at providing an overview of LLM applications in SR conduct in health research.

Methods

We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: February 26, 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, 1 reviewer extracted data, checked by another.

Results

Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n = 15, 41%), study selection (n = 14, 38%), and data extraction (n = 11, 30%). The mostly recurring LLM was Generative Pretrained Transformer (GPT) (n = 33, 89%). Validation studies were predominant (n = 21, 57%). In half of the studies, authors evaluated LLM use as promising (n = 20, 54%), one-quarter as neutral (n = 9, 24%) and one-fifth as nonpromising (n = 8, 22%).

Conclusion

Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance.

Plain Language Summary

Systematic reviews are a crucial tool in health research where experts carefully collect and analyze all available evidence on a specific research question. Creating these reviews is typically time- and resource-intensive, often taking months or even years to complete, as researchers must thoroughly search, evaluate, and synthesize an immense number of scientific studies. For the present article, we conducted a review to understand how new artificial intelligence (AI) tools, specifically large language models (LLMs) like Generative Pretrained Transformer (GPT), can be used to help create systematic reviews in health research. We searched multiple scientific databases and finally found 37 relevant articles. We found that LLMs have been tested to help with various parts of the systematic review process, particularly in 3 main areas: searching scientific literature (41% of studies), selecting relevant studies (38%), and extracting important information from these studies (30%). GPT was the most commonly used LLM, appearing in 89% of the studies. Most of the research (57%) focused on testing whether these AI tools actually work as intended in this context of systematic review production. The results were mixed: about half of the studies found LLMs promising, a quarter were neutral, and one-fifth found them not promising. While LLMs show potential for making the systematic review process more efficient, there is still a lack of fully tested and validated applications. However, the increasing number of studies in this field suggests that these AI tools are becoming increasingly important in creating systematic reviews.

Abstract Image

用于进行系统综述的大型语言模型:正在兴起,但尚未准备就绪--范围综述。
背景:机器学习(ML)承诺在创建系统评论(SRs)方面提供多种帮助。近年来,大型语言模型(llm)的进一步发展及其在SR行为中的应用引起了人们的关注。目的:综述法学硕士在医学科学研究中的应用。研究设计:我们系统地检索了MEDLINE、Web of Science、IEEEXplore、ACM数字图书馆、欧洲PMC(预印本)、谷歌Scholar,并进行了额外的手工检索(最后一次检索:2024年2月26日)。我们纳入了从2021年4月起发表的英语或德语科学文章,基于尚未确定支持sr的法学硕士应用程序的绘图审查结果。两位审稿人独立筛选研究的合格性;试验结束后,一位审稿人提取数据,由另一位审稿人检查。结果:我们的数据库搜索产生了8054个命中,我们从手工搜索中确定了33篇文章。我们最终收录了37篇关于LLM支持的文章。LLM方法涵盖了13个定义的SR步骤中的10个,最常见的是文献检索(n= 15,41 %),研究选择(n= 14,38 %)和数据提取(n= 11,30 %)。复发最多的LLM是GPT (n= 33.3, 89%)。验证研究占主导地位(n=21, 57%)。在一半的研究中,作者将LLM的使用评价为有希望(n= 20,54%),四分之一评价为中性(n= 9,24%),五分之一评价为无希望(n= 8,22%)。结论:尽管法学硕士显示出支持SR创建的承诺,但通常缺乏完全建立或验证的应用程序。法学硕士在证据合成生产方面的研究迅速增加,凸显了它们日益增长的相关性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Clinical Epidemiology
Journal of Clinical Epidemiology 医学-公共卫生、环境卫生与职业卫生
CiteScore
12.00
自引率
6.90%
发文量
320
审稿时长
44 days
期刊介绍: The Journal of Clinical Epidemiology strives to enhance the quality of clinical and patient-oriented healthcare research by advancing and applying innovative methods in conducting, presenting, synthesizing, disseminating, and translating research results into optimal clinical practice. Special emphasis is placed on training new generations of scientists and clinical practice leaders.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信