Noise Correction using Bayesian Multiple Imputation

2006 IEEE International Conference on Information Reuse & Integration Pub Date : 2006-12-04 DOI:10.1109/IRI.2006.252461

J. V. Hulse, T. Khoshgoftaar, Chris Seiffert, Lili Zhao

{"title":"Noise Correction using Bayesian Multiple Imputation","authors":"J. V. Hulse, T. Khoshgoftaar, Chris Seiffert, Lili Zhao","doi":"10.1109/IRI.2006.252461","DOIUrl":null,"url":null,"abstract":"This work presents a novel procedure to detect and correct noise in a continuous dependent variable. The presence of noise in a dataset represents a significant challenge to data mining algorithms, as incorrect values in both the independent and dependent variables can severely corrupt the results of even robust learners. The problem of noise is especially severe when it is located in the dependent variable. In the worst case, severe noise in one of the independent variables can be handled by eliminating that attribute from the dataset, provided that the practitioner knows that noise is present. In the setting of supervised learning, the dependent variable is the most critical attribute in the dataset and therefore cannot be eliminated even if significant noise is present. Noise handling procedures in relation to the dependent variable are therefore absolutely critical to the success of a supervised learning initiative. In contrast to a binary dependent variable or class, noise in a continuous dependent variable presents many additional difficulties. Our procedure to detect and correct noise in a continuous dependent variable uses Bayesian multiple imputation, which was initially developed to combat the problem of missing data. Our case study considers a real-world software measurement dataset called CCCS, which has a numeric dependent variable with inherent noise. The results of our experiments show very encouraging results and clearly demonstrate the utility of our procedure","PeriodicalId":402255,"journal":{"name":"2006 IEEE International Conference on Information Reuse & Integration","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 IEEE International Conference on Information Reuse & Integration","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2006.252461","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

This work presents a novel procedure to detect and correct noise in a continuous dependent variable. The presence of noise in a dataset represents a significant challenge to data mining algorithms, as incorrect values in both the independent and dependent variables can severely corrupt the results of even robust learners. The problem of noise is especially severe when it is located in the dependent variable. In the worst case, severe noise in one of the independent variables can be handled by eliminating that attribute from the dataset, provided that the practitioner knows that noise is present. In the setting of supervised learning, the dependent variable is the most critical attribute in the dataset and therefore cannot be eliminated even if significant noise is present. Noise handling procedures in relation to the dependent variable are therefore absolutely critical to the success of a supervised learning initiative. In contrast to a binary dependent variable or class, noise in a continuous dependent variable presents many additional difficulties. Our procedure to detect and correct noise in a continuous dependent variable uses Bayesian multiple imputation, which was initially developed to combat the problem of missing data. Our case study considers a real-world software measurement dataset called CCCS, which has a numeric dependent variable with inherent noise. The results of our experiments show very encouraging results and clearly demonstrate the utility of our procedure

查看原文本刊更多论文

基于贝叶斯多元插值的噪声校正

本文提出了一种检测和校正连续因变量噪声的新方法。数据集中噪声的存在对数据挖掘算法提出了重大挑战，因为自变量和因变量的不正确值可能严重破坏即使是鲁棒学习器的结果。当噪声位于因变量中时，噪声问题尤为严重。在最坏的情况下，如果从业者知道存在噪声，则可以通过从数据集中消除其中一个自变量中的严重噪声来处理。在监督学习的设置中，因变量是数据集中最关键的属性，因此即使存在显著的噪声也无法消除。因此，与因变量相关的噪声处理程序对监督学习计划的成功至关重要。与二元因变量或类相比，连续因变量中的噪声会带来许多额外的困难。我们在连续因变量中检测和纠正噪声的过程使用贝叶斯多重插值，这最初是为了解决数据缺失问题而开发的。我们的案例研究考虑了一个名为CCCS的真实软件测量数据集，该数据集具有带有固有噪声的数值因变量。我们的实验结果显示了非常令人鼓舞的结果，并清楚地证明了我们的方法的实用性

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2006 IEEE International Conference on Information Reuse & Integration

自引率

0.00%

发文量