Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data

G. Liebchen, Bhekisipho Twala, M. Shepperd, M. Cartwright, Mark Stephens
{"title":"Filtering, Robust Filtering, Polishing: Techniques for Addressing Quality in Software Data","authors":"G. Liebchen, Bhekisipho Twala, M. Shepperd, M. Cartwright, Mark Stephens","doi":"10.1109/ESEM.2007.70","DOIUrl":null,"url":null,"abstract":"Data quality is an important aspect of empirical analysis. This paper compares three noise handling methods to assess the benefit of identifying and either filtering or editing problematic instances. We compare a 'do nothing' strategy with (i) filtering, (ii) robust filtering and (Hi) filtering followed by polishing. A problem is that it is not possible to determine whether an instance contains noise unless it has implausible values. Since we cannot determine the true overall noise level we use implausible val.ues as a proxy measure. In addition to the ability to identify implausible values, we use another proxy measure, the ability to fit a classification tree to the data. The interpretation is low misclassification rates imply low noise levels. We found that all three of our data quality techniques improve upon the 'do nothing' strategy, also that the filtering and polishing was the most effective technique for dealing with noise since we eliminated the fewest data and had the lowest misclassification rates. Unfortunately the polishing process introduces new implausible values. We believe consideration of data quality is an important aspect of empirical software engineering. We have shown that for one large and complex real world data set automated techniques can help isolate noisy instances and potentially polish the values to produce better quality data for the analyst. However this work is at a preliminary stage and it assumes that the proxy measures of lity are appropriate.","PeriodicalId":124420,"journal":{"name":"First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESEM.2007.70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 27

Abstract

Data quality is an important aspect of empirical analysis. This paper compares three noise handling methods to assess the benefit of identifying and either filtering or editing problematic instances. We compare a 'do nothing' strategy with (i) filtering, (ii) robust filtering and (Hi) filtering followed by polishing. A problem is that it is not possible to determine whether an instance contains noise unless it has implausible values. Since we cannot determine the true overall noise level we use implausible val.ues as a proxy measure. In addition to the ability to identify implausible values, we use another proxy measure, the ability to fit a classification tree to the data. The interpretation is low misclassification rates imply low noise levels. We found that all three of our data quality techniques improve upon the 'do nothing' strategy, also that the filtering and polishing was the most effective technique for dealing with noise since we eliminated the fewest data and had the lowest misclassification rates. Unfortunately the polishing process introduces new implausible values. We believe consideration of data quality is an important aspect of empirical software engineering. We have shown that for one large and complex real world data set automated techniques can help isolate noisy instances and potentially polish the values to produce better quality data for the analyst. However this work is at a preliminary stage and it assumes that the proxy measures of lity are appropriate.
滤波,鲁棒滤波,抛光:处理软件数据质量的技术
数据质量是实证分析的一个重要方面。本文比较了三种噪声处理方法,以评估识别和过滤或编辑问题实例的好处。我们将“什么都不做”策略与(i)滤波,(ii)鲁棒滤波和(Hi)滤波进行比较,然后进行抛光。一个问题是,除非实例具有令人难以置信的值,否则不可能确定实例是否包含噪声。由于我们无法确定真实的总体噪声水平,我们使用不可信的值作为代理度量。除了识别不可信值的能力之外,我们还使用了另一种代理度量,即为数据拟合分类树的能力。解释是低误分类率意味着低噪音水平。我们发现,我们的所有三种数据质量技术都在“什么都不做”的策略上得到了改进,而且过滤和抛光是处理噪声的最有效技术,因为我们消除了最少的数据,错误分类率最低。不幸的是,抛光过程引入了新的难以置信的价值。我们相信考虑数据质量是经验软件工程的一个重要方面。我们已经表明,对于一个庞大而复杂的现实世界数据集,自动化技术可以帮助隔离有噪声的实例,并可能优化值,从而为分析人员生成更高质量的数据。然而,这项工作尚处于初步阶段,并假定替代度量是适当的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信