The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review

arXiv - CS - Digital Libraries Pub Date : 2024-09-06 DOI:arxiv-2409.04600

Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert

{"title":"The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review","authors":"Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert","doi":"arxiv-2409.04600","DOIUrl":null,"url":null,"abstract":"Objective: This study aims to summarize the usage of Large Language Models\n(LLMs) in the process of creating a scientific review. We look at the range of\nstages in a review that can be automated and assess the current\nstate-of-the-art research projects in the field. Materials and Methods: The\nsearch was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google\nScholar databases by human reviewers. Screening and extraction process took\nplace in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model.\nChatGPT was used to clean extracted data and generate code for figures in this\nmanuscript, ChatGPT and Scite.ai were used in drafting all components of the\nmanuscript, except the methods and discussion sections. Results: 3,788 articles\nwere retrieved, and 172 studies were deemed eligible for the final review.\nChatGPT and GPT-based LLM emerged as the most dominant architecture for review\nautomation (n=126, 73.2%). A significant number of review automation projects\nwere found, but only a limited number of papers (n=26, 15.1%) were actual\nreviews that used LLM during their creation. Most citations focused on\nautomation of a particular stage of review, such as Searching for publications\n(n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled\nperformance of GPT-based and BERT-based models, the former were better in data\nextraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8),\nwhile being slightly less accurate in title and abstract screening stage\n(Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic\nreview revealed a significant number of research projects related to review\nautomation using LLMs. The results looked promising, and we anticipate that\nLLMs will change in the near future the way the scientific reviews are\nconducted.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04600","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field. Materials and Methods: The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar databases by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model. ChatGPT was used to clean extracted data and generate code for figures in this manuscript, ChatGPT and Scite.ai were used in drafting all components of the manuscript, except the methods and discussion sections. Results: 3,788 articles were retrieved, and 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n=126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n=26, 15.1%) were actual reviews that used LLM during their creation. Most citations focused on automation of a particular stage of review, such as Searching for publications (n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled performance of GPT-based and BERT-based models, the former were better in data extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8), while being slightly less accurate in title and abstract screening stage (Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. The results looked promising, and we anticipate that LLMs will change in the near future the way the scientific reviews are conducted.

查看原文本刊更多论文

大语言模型（LLM）作为文献综述工具的出现：LLM 自动系统综述

研究目的本研究旨在总结大型语言模型（LLM）在撰写科学评论过程中的使用情况。我们探讨了可实现自动化的综述阶段的范围，并评估了该领域当前最先进的研究项目。材料与方法：这些研究于 2024 年 6 月由人类审稿人在 PubMed、Scopus、Dimensions 和 GoogleScholar 数据库中进行。ChatGPT 用于清理提取的数据并生成本手稿中图表的代码，ChatGPT 和 Scite.ai 用于起草手稿中除方法和讨论部分以外的所有部分。结果共检索到 3,788 篇文章，其中 172 项研究被认为符合最终评审条件。我们发现了大量的审稿自动化项目，但只有少数论文（n=26，15.1%）是在创建过程中使用了 LLM 的实际审稿。大多数引用都集中在审稿某一特定阶段的自动化，如搜索出版物（60篇，占34.9%）和数据提取（54篇，占31.4%）。在比较基于 GPT 的模型和基于 BERT 的模型的综合表现时，前者在数据提取方面表现更好，平均精确度为 83.0% (SD=10.4)，召回率为 86.0% (SD=9.8)，而在标题和摘要筛选阶段的精确度稍低（Maccuracy=77.3%，SD=13.0）。讨论/结论：我们的LLM辅助系统综述揭示了大量与使用LLM进行综述自动化相关的研究项目。我们预计，在不久的将来，LLM 将改变科学综述的方式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Digital Libraries

自引率

0.00%

发文量