Exploring the use of LLMs for the selection phase in systematic literature studies

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-05-02 DOI:10.1016/j.infsof.2025.107757

Lukas Thode , Umar Iftikhar , Daniel Mendez

{"title":"Exploring the use of LLMs for the selection phase in systematic literature studies","authors":"Lukas Thode , Umar Iftikhar , Daniel Mendez","doi":"10.1016/j.infsof.2025.107757","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Systematic literature studies, such as secondary studies, are crucial to aggregate evidence. An essential part of these studies is the selection phase of relevant studies. This, however, is time-consuming, resource-intensive, and error-prone as it highly depends on manual labor and domain expertise. The increasing popularity of Large Language Models (LLMs) raises the question to what extent these manual study selection tasks could be supported in an automated manner.</div></div><div><h3>Objectives:</h3><div>In this manuscript, we report on our effort to explore and evaluate the use of state-of-the-art LLMs to automate the selection phase in systematic literature studies.</div></div><div><h3>Method:</h3><div>We evaluated LLMs for the selection phase using two published systematic literature studies in software engineering as ground truth. Three prompts were designed and applied across five LLMs to the studies’ titles and abstracts based on their inclusion and exclusion criteria. Additionally, we analyzed combining two LLMs to replicate a practical selection phase. We analyzed recall and precision and reflected upon the accuracy of the LLMs, and whether the ground truth studies were conducted by early career scholars or by more advanced ones.</div></div><div><h3>Results:</h3><div>Our results show a high average recall of up to 98% combined with a precision of 27% in a single LLM approach and an average recall of 99% with a precision of 27% in a two-model approach replicating a two-reviewer procedure. Further the Llama 2 models showed the highest average recall 98% across all prompt templates and datasets while GPT4-turbo had the highest average precision 72%.</div></div><div><h3>Conclusions:</h3><div>Our results demonstrate how LLMs could support a selection phase in the future. We recommend a two LLM-approach to archive a higher recall. However, we also critically reflect upon how further studies are required using other models and prompts on more datasets to strengthen the confidence in our presented approach.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"184 ","pages":"Article 107757"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925000965","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

Systematic literature studies, such as secondary studies, are crucial to aggregate evidence. An essential part of these studies is the selection phase of relevant studies. This, however, is time-consuming, resource-intensive, and error-prone as it highly depends on manual labor and domain expertise. The increasing popularity of Large Language Models (LLMs) raises the question to what extent these manual study selection tasks could be supported in an automated manner.

Objectives:

In this manuscript, we report on our effort to explore and evaluate the use of state-of-the-art LLMs to automate the selection phase in systematic literature studies.

Method:

We evaluated LLMs for the selection phase using two published systematic literature studies in software engineering as ground truth. Three prompts were designed and applied across five LLMs to the studies’ titles and abstracts based on their inclusion and exclusion criteria. Additionally, we analyzed combining two LLMs to replicate a practical selection phase. We analyzed recall and precision and reflected upon the accuracy of the LLMs, and whether the ground truth studies were conducted by early career scholars or by more advanced ones.

Results:

Our results show a high average recall of up to 98% combined with a precision of 27% in a single LLM approach and an average recall of 99% with a precision of 27% in a two-model approach replicating a two-reviewer procedure. Further the Llama 2 models showed the highest average recall 98% across all prompt templates and datasets while GPT4-turbo had the highest average precision 72%.

Conclusions:

Our results demonstrate how LLMs could support a selection phase in the future. We recommend a two LLM-approach to archive a higher recall. However, we also critically reflect upon how further studies are required using other models and prompts on more datasets to strengthen the confidence in our presented approach.

查看原文本刊更多论文

探索法学硕士在系统文献研究中选择阶段的使用

背景：系统的文献研究，如次要研究，对于收集证据至关重要。这些研究的一个重要部分是相关研究的选择阶段。然而，这是耗时的，资源密集的，并且容易出错，因为它高度依赖于手工劳动和领域专业知识。大型语言模型（llm）的日益普及提出了一个问题，即这些手工学习选择任务在多大程度上可以以自动化的方式得到支持。目的：在这篇论文中，我们报告了我们探索和评估在系统文献研究中使用最先进的法学硕士来自动化选择阶段的努力。方法：我们使用两篇已发表的软件工程系统文献研究作为基础事实来评估llm的选择阶段。根据纳入和排除标准，设计了三个提示，并在五个法学硕士中应用于研究的标题和摘要。此外，我们分析了结合两个llm来复制实际的选择阶段。我们分析了召回率和准确性，并反思了法学硕士的准确性，以及地面真相研究是由早期职业学者还是由更高级的学者进行的。结果：我们的结果显示，在单一LLM方法中，平均召回率高达98%，精度为27%；在复制双审稿人程序的双模型方法中，平均召回率为99%，精度为27%。此外，Llama 2模型在所有提示模板和数据集上的平均召回率最高，为98%，而GPT4-turbo的平均召回率最高，为72%。结论：我们的研究结果表明法学硕士如何支持未来的选择阶段。我们建议采用两个llm方法来存档更高的召回率。然而，我们也批判性地反思了如何使用其他模型和更多数据集上的提示进行进一步的研究，以加强对我们提出的方法的信心。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.