{"title":"HSDD: A hybrid sampling strategy for class imbalance in defect prediction data sets","authors":"M. Öztürk, A. Zengin","doi":"10.1109/FGCT.2016.7605093","DOIUrl":null,"url":null,"abstract":"Class imbalance is a common problem in defect prediction data sets. In order to cope with this problem, over-sampling and under sampling methods are employed. However, these methods are designed for instance based alteration and not specialized for feature space. Also there is not any distinctive approach to cope with class imbalance in defect prediction data sets. We develop HSDD (hybrid sampling for defect data sets) to solve this problem. HSDD comprises not only derivation of low-level metrics, but also reduction processes of repeated data points. The method was evaluated on industrial and open source project data sets by using Bayes, naive Bayes, random forest, and J48 in terms of g-mean and training time. Obtained results show that HSDD produces promising training performance especially in large-scale data sets.","PeriodicalId":146662,"journal":{"name":"2016 Eleventh International Conference on Digital Information Management (ICDIM)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Eleventh International Conference on Digital Information Management (ICDIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FGCT.2016.7605093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Class imbalance is a common problem in defect prediction data sets. In order to cope with this problem, over-sampling and under sampling methods are employed. However, these methods are designed for instance based alteration and not specialized for feature space. Also there is not any distinctive approach to cope with class imbalance in defect prediction data sets. We develop HSDD (hybrid sampling for defect data sets) to solve this problem. HSDD comprises not only derivation of low-level metrics, but also reduction processes of repeated data points. The method was evaluated on industrial and open source project data sets by using Bayes, naive Bayes, random forest, and J48 in terms of g-mean and training time. Obtained results show that HSDD produces promising training performance especially in large-scale data sets.