Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data

First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007) Pub Date : 1900-01-01 DOI:10.1109/ESEM.2007.70

G. Liebchen, Bhekisipho Twala, M. Shepperd, M. Cartwright, Mark Stephens

{"title":"Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data","authors":"G. Liebchen, Bhekisipho Twala, M. Shepperd, M. Cartwright, Mark Stephens","doi":"10.1109/ESEM.2007.70","DOIUrl":null,"url":null,"abstract":"Data quality is an important aspect of empirical analysis. This paper compares three noise handling methods to assess the benefit of identifying and either filtering or editing problematic instances. We compare a 'do nothing' strategy with (i) filtering, (ii) robust filtering and (Hi) filtering followed by polishing. A problem is that it is not possible to determine whether an instance contains noise unless it has implausible values. Since we cannot determine the true overall noise level we use implausible val.ues as a proxy measure. In addition to the ability to identify implausible values, we use another proxy measure, the ability to fit a classification tree to the data. The interpretation is low misclassification rates imply low noise levels. We found that all three of our data quality techniques improve upon the 'do nothing' strategy, also that the filtering and polishing was the most effective technique for dealing with noise since we eliminated the fewest data and had the lowest misclassification rates. Unfortunately the polishing process introduces new implausible values. We believe consideration of data quality is an important aspect of empirical software engineering. We have shown that for one large and complex real world data set automated techniques can help isolate noisy instances and potentially polish the values to produce better quality data for the analyst. However this work is at a preliminary stage and it assumes that the proxy measures of lity are appropriate.","PeriodicalId":124420,"journal":{"name":"First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESEM.2007.70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

Abstract

Data quality is an important aspect of empirical analysis. This paper compares three noise handling methods to assess the benefit of identifying and either filtering or editing problematic instances. We compare a 'do nothing' strategy with (i) filtering, (ii) robust filtering and (Hi) filtering followed by polishing. A problem is that it is not possible to determine whether an instance contains noise unless it has implausible values. Since we cannot determine the true overall noise level we use implausible val.ues as a proxy measure. In addition to the ability to identify implausible values, we use another proxy measure, the ability to fit a classification tree to the data. The interpretation is low misclassification rates imply low noise levels. We found that all three of our data quality techniques improve upon the 'do nothing' strategy, also that the filtering and polishing was the most effective technique for dealing with noise since we eliminated the fewest data and had the lowest misclassification rates. Unfortunately the polishing process introduces new implausible values. We believe consideration of data quality is an important aspect of empirical software engineering. We have shown that for one large and complex real world data set automated techniques can help isolate noisy instances and potentially polish the values to produce better quality data for the analyst. However this work is at a preliminary stage and it assumes that the proxy measures of lity are appropriate.

查看原文本刊更多论文

滤波，鲁棒滤波，抛光:处理软件数据质量的技术

数据质量是实证分析的一个重要方面。本文比较了三种噪声处理方法，以评估识别和过滤或编辑问题实例的好处。我们将“什么都不做”策略与(i)滤波，(ii)鲁棒滤波和(Hi)滤波进行比较，然后进行抛光。一个问题是，除非实例具有令人难以置信的值，否则不可能确定实例是否包含噪声。由于我们无法确定真实的总体噪声水平，我们使用不可信的值作为代理度量。除了识别不可信值的能力之外，我们还使用了另一种代理度量，即为数据拟合分类树的能力。解释是低误分类率意味着低噪音水平。我们发现，我们的所有三种数据质量技术都在“什么都不做”的策略上得到了改进，而且过滤和抛光是处理噪声的最有效技术，因为我们消除了最少的数据，错误分类率最低。不幸的是，抛光过程引入了新的难以置信的价值。我们相信考虑数据质量是经验软件工程的一个重要方面。我们已经表明，对于一个庞大而复杂的现实世界数据集，自动化技术可以帮助隔离有噪声的实例，并可能优化值，从而为分析人员生成更高质量的数据。然而，这项工作尚处于初步阶段，并假定替代度量是适当的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)

自引率

0.00%

发文量