Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert
{"title":"The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review","authors":"Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert","doi":"arxiv-2409.04600","DOIUrl":null,"url":null,"abstract":"Objective: This study aims to summarize the usage of Large Language Models\n(LLMs) in the process of creating a scientific review. We look at the range of\nstages in a review that can be automated and assess the current\nstate-of-the-art research projects in the field. Materials and Methods: The\nsearch was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google\nScholar databases by human reviewers. Screening and extraction process took\nplace in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model.\nChatGPT was used to clean extracted data and generate code for figures in this\nmanuscript, ChatGPT and Scite.ai were used in drafting all components of the\nmanuscript, except the methods and discussion sections. Results: 3,788 articles\nwere retrieved, and 172 studies were deemed eligible for the final review.\nChatGPT and GPT-based LLM emerged as the most dominant architecture for review\nautomation (n=126, 73.2%). A significant number of review automation projects\nwere found, but only a limited number of papers (n=26, 15.1%) were actual\nreviews that used LLM during their creation. Most citations focused on\nautomation of a particular stage of review, such as Searching for publications\n(n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled\nperformance of GPT-based and BERT-based models, the former were better in data\nextraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8),\nwhile being slightly less accurate in title and abstract screening stage\n(Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic\nreview revealed a significant number of research projects related to review\nautomation using LLMs. The results looked promising, and we anticipate that\nLLMs will change in the near future the way the scientific reviews are\nconducted.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04600","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: This study aims to summarize the usage of Large Language Models
(LLMs) in the process of creating a scientific review. We look at the range of
stages in a review that can be automated and assess the current
state-of-the-art research projects in the field. Materials and Methods: The
search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google
Scholar databases by human reviewers. Screening and extraction process took
place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model.
ChatGPT was used to clean extracted data and generate code for figures in this
manuscript, ChatGPT and Scite.ai were used in drafting all components of the
manuscript, except the methods and discussion sections. Results: 3,788 articles
were retrieved, and 172 studies were deemed eligible for the final review.
ChatGPT and GPT-based LLM emerged as the most dominant architecture for review
automation (n=126, 73.2%). A significant number of review automation projects
were found, but only a limited number of papers (n=26, 15.1%) were actual
reviews that used LLM during their creation. Most citations focused on
automation of a particular stage of review, such as Searching for publications
(n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled
performance of GPT-based and BERT-based models, the former were better in data
extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8),
while being slightly less accurate in title and abstract screening stage
(Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic
review revealed a significant number of research projects related to review
automation using LLMs. The results looked promising, and we anticipate that
LLMs will change in the near future the way the scientific reviews are
conducted.