G. Liebchen, Bhekisipho Twala, M. Shepperd, M. Cartwright, Mark Stephens
{"title":"Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data","authors":"G. Liebchen, Bhekisipho Twala, M. Shepperd, M. Cartwright, Mark Stephens","doi":"10.1109/ESEM.2007.70","DOIUrl":null,"url":null,"abstract":"Data quality is an important aspect of empirical analysis. This paper compares three noise handling methods to assess the benefit of identifying and either filtering or editing problematic instances. We compare a 'do nothing' strategy with (i) filtering, (ii) robust filtering and (Hi) filtering followed by polishing. A problem is that it is not possible to determine whether an instance contains noise unless it has implausible values. Since we cannot determine the true overall noise level we use implausible val.ues as a proxy measure. In addition to the ability to identify implausible values, we use another proxy measure, the ability to fit a classification tree to the data. The interpretation is low misclassification rates imply low noise levels. We found that all three of our data quality techniques improve upon the 'do nothing' strategy, also that the filtering and polishing was the most effective technique for dealing with noise since we eliminated the fewest data and had the lowest misclassification rates. Unfortunately the polishing process introduces new implausible values. We believe consideration of data quality is an important aspect of empirical software engineering. We have shown that for one large and complex real world data set automated techniques can help isolate noisy instances and potentially polish the values to produce better quality data for the analyst. However this work is at a preliminary stage and it assumes that the proxy measures of lity are appropriate.","PeriodicalId":124420,"journal":{"name":"First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESEM.2007.70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 27
Abstract
Data quality is an important aspect of empirical analysis. This paper compares three noise handling methods to assess the benefit of identifying and either filtering or editing problematic instances. We compare a 'do nothing' strategy with (i) filtering, (ii) robust filtering and (Hi) filtering followed by polishing. A problem is that it is not possible to determine whether an instance contains noise unless it has implausible values. Since we cannot determine the true overall noise level we use implausible val.ues as a proxy measure. In addition to the ability to identify implausible values, we use another proxy measure, the ability to fit a classification tree to the data. The interpretation is low misclassification rates imply low noise levels. We found that all three of our data quality techniques improve upon the 'do nothing' strategy, also that the filtering and polishing was the most effective technique for dealing with noise since we eliminated the fewest data and had the lowest misclassification rates. Unfortunately the polishing process introduces new implausible values. We believe consideration of data quality is an important aspect of empirical software engineering. We have shown that for one large and complex real world data set automated techniques can help isolate noisy instances and potentially polish the values to produce better quality data for the analyst. However this work is at a preliminary stage and it assumes that the proxy measures of lity are appropriate.