W. Zogaan, Palak Sharma, Mehdi Mirakhorli, V. Arnaoudova
{"title":"Datasets from Fifteen Years of Automated Requirements Traceability Research: Current State, Characteristics, and Quality","authors":"W. Zogaan, Palak Sharma, Mehdi Mirakhorli, V. Arnaoudova","doi":"10.1109/RE.2017.80","DOIUrl":null,"url":null,"abstract":"Software datasets play a crucial role in advancing automated software traceability research. They can be used by researchers in different ways to develop or validate new automated approaches. The diversity and quality of the datasets within a research community have a significant impact on the accuracy, generalizability, and reproducibility of the results and consequently on the usefulness and practicality of the techniques under study. Collecting and assessing the quality of such datasets are not trivial tasks and have been reported as an obstacle by many researchers in the domain of software engineering. This paper presents a first-of-its-kind study to review and assess the datasets that have been used in software traceability research over the last fifteen years. It presents and articulates the current status of these datasets, their characteristics, and their threats to validity. Furthermore, this paper introduces a Traceability-Dataset Quality Assessment (T-DQA) framework to categorize software traceability datasets and assist researchers to select appropriate datasets for their research based on different characteristics of the datasets and the context in which those datasets will be used.","PeriodicalId":176958,"journal":{"name":"2017 IEEE 25th International Requirements Engineering Conference (RE)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 25th International Requirements Engineering Conference (RE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RE.2017.80","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21
Abstract
Software datasets play a crucial role in advancing automated software traceability research. They can be used by researchers in different ways to develop or validate new automated approaches. The diversity and quality of the datasets within a research community have a significant impact on the accuracy, generalizability, and reproducibility of the results and consequently on the usefulness and practicality of the techniques under study. Collecting and assessing the quality of such datasets are not trivial tasks and have been reported as an obstacle by many researchers in the domain of software engineering. This paper presents a first-of-its-kind study to review and assess the datasets that have been used in software traceability research over the last fifteen years. It presents and articulates the current status of these datasets, their characteristics, and their threats to validity. Furthermore, this paper introduces a Traceability-Dataset Quality Assessment (T-DQA) framework to categorize software traceability datasets and assist researchers to select appropriate datasets for their research based on different characteristics of the datasets and the context in which those datasets will be used.