{"title":"Impact of Data Sampling on Stability of Feature Selection for Software Measurement Data","authors":"Kehan Gao, T. Khoshgoftaar, Amri Napolitano","doi":"10.1109/ICTAI.2011.172","DOIUrl":null,"url":null,"abstract":"Software defect prediction can be considered a binary classification problem. Generally, practitioners utilize historical software data, including metric and fault data collected during the software development process, to build a classification model and then employ this model to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). Limited project resources can then be allocated according to the prediction results by (for example) assigning more reviews and testing to the modules predicted to be potentially defective. Two challenges often come with the modeling process: (1) high-dimensionality of software measurement data and (2) skewed or imbalanced distributions between the two types of modules (fp and nfp) in those datasets. To overcome these problems, extensive studies have been dedicated towards improving the quality of training data. The commonly used techniques are feature selection and data sampling. Usually, researchers focus on evaluating classification performance after the training data is modified. The present study assesses a feature selection technique from a different perspective. We are more interested in studying the stability of a feature selection method, especially in understanding the impact of data sampling techniques on the stability of feature selection when using the sampled data. Some interesting findings are found based on two case studies performed on datasets from two real-world software projects.","PeriodicalId":332661,"journal":{"name":"2011 IEEE 23rd International Conference on Tools with Artificial Intelligence","volume":"147 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 23rd International Conference on Tools with Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2011.172","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18
Abstract
Software defect prediction can be considered a binary classification problem. Generally, practitioners utilize historical software data, including metric and fault data collected during the software development process, to build a classification model and then employ this model to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). Limited project resources can then be allocated according to the prediction results by (for example) assigning more reviews and testing to the modules predicted to be potentially defective. Two challenges often come with the modeling process: (1) high-dimensionality of software measurement data and (2) skewed or imbalanced distributions between the two types of modules (fp and nfp) in those datasets. To overcome these problems, extensive studies have been dedicated towards improving the quality of training data. The commonly used techniques are feature selection and data sampling. Usually, researchers focus on evaluating classification performance after the training data is modified. The present study assesses a feature selection technique from a different perspective. We are more interested in studying the stability of a feature selection method, especially in understanding the impact of data sampling techniques on the stability of feature selection when using the sampled data. Some interesting findings are found based on two case studies performed on datasets from two real-world software projects.