Supporting Systematic Literature Reviews Using Deep-Learning-Based Language Models

2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE) Pub Date : 2022-05-01 DOI:10.1145/3528588.3528658

Rand Alchokr, M. Borkar, Sharanya Thotadarya, G. Saake, Thomas Leich

{"title":"Supporting Systematic Literature Reviews Using Deep-Learning-Based Language Models","authors":"Rand Alchokr, M. Borkar, Sharanya Thotadarya, G. Saake, Thomas Leich","doi":"10.1145/3528588.3528658","DOIUrl":null,"url":null,"abstract":"Background: Systematic Literature Reviews are an important research method for gathering and evaluating the available evidence regarding a specific research topic. However, the process of conducting a Systematic Literature Review manually can be difficult and time-consuming. For this reason, researchers aim to semi-automate this process or some of its phases.Aim: We aimed at using a deep-learning based contextualized embeddings clustering technique involving transformer-based language models and a weighted scheme to accelerate the conduction phase of Systematic Literature Reviews for efficiently scanning the initial set of retrieved publications.Method: We performed an experiment using two manually conducted SLRs to evaluate the performance of two deep-learning-based clustering models. These models build on transformer-based deep language models (i.e., BERT and S-BERT) to extract contextualized embeddings on different text levels along with a weighted scheme to cluster similar publications.Results: Our primary results show that clustering based on embedding at paragraph-level using S-BERT-paragraph represents the best performing model setting in terms of optimizing the required parameters such as correctly identifying primary studies, number of additional documents identified as part of the relevant cluster and the execution time of the experiments.Conclusions: The findings indicate that using natural-language-based deep-learning architectures for semi-automating the selection of primary studies can accelerate the scanning and identification process. While our results represent first insights only, such a technique seems to enhance SLR process, promising to help researchers identify the most relevant publications more quickly and efficiently.","PeriodicalId":313397,"journal":{"name":"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3528588.3528658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Background: Systematic Literature Reviews are an important research method for gathering and evaluating the available evidence regarding a specific research topic. However, the process of conducting a Systematic Literature Review manually can be difficult and time-consuming. For this reason, researchers aim to semi-automate this process or some of its phases.Aim: We aimed at using a deep-learning based contextualized embeddings clustering technique involving transformer-based language models and a weighted scheme to accelerate the conduction phase of Systematic Literature Reviews for efficiently scanning the initial set of retrieved publications.Method: We performed an experiment using two manually conducted SLRs to evaluate the performance of two deep-learning-based clustering models. These models build on transformer-based deep language models (i.e., BERT and S-BERT) to extract contextualized embeddings on different text levels along with a weighted scheme to cluster similar publications.Results: Our primary results show that clustering based on embedding at paragraph-level using S-BERT-paragraph represents the best performing model setting in terms of optimizing the required parameters such as correctly identifying primary studies, number of additional documents identified as part of the relevant cluster and the execution time of the experiments.Conclusions: The findings indicate that using natural-language-based deep-learning architectures for semi-automating the selection of primary studies can accelerate the scanning and identification process. While our results represent first insights only, such a technique seems to enhance SLR process, promising to help researchers identify the most relevant publications more quickly and efficiently.

查看原文本刊更多论文

使用基于深度学习的语言模型支持系统文献综述

背景:系统文献综述是收集和评估关于特定研究主题的现有证据的重要研究方法。然而，手动进行系统文献综述的过程可能是困难和耗时的。出于这个原因，研究人员的目标是将这一过程或其某些阶段半自动化。目的:我们旨在使用基于深度学习的上下文嵌入聚类技术，包括基于转换器的语言模型和加权方案，以加速系统文献综述的传导阶段，从而有效地扫描检索到的初始出版物集。方法:我们使用两个手动单反进行了实验，以评估两个基于深度学习的聚类模型的性能。这些模型建立在基于转换器的深度语言模型(即BERT和S-BERT)上，以提取不同文本级别上的上下文化嵌入，并使用加权方案对类似出版物进行聚类。结果:我们的初步结果表明，在优化所需参数方面，基于s - bert段落嵌入的聚类是表现最好的模型设置，这些参数包括正确识别主要研究、识别为相关聚类一部分的额外文档数量以及实验的执行时间。结论:研究结果表明，使用基于自然语言的深度学习架构进行半自动化的初级研究选择可以加速扫描和识别过程。虽然我们的研究结果只代表了初步的见解，但这种技术似乎可以增强单反过程，有望帮助研究人员更快、更有效地识别最相关的出版物。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)

自引率

0.00%

发文量