Impact of missing value imputation on classification for DNA microarray gene expression data--a model-based study.

Youting Sun, Ulisses Braga-Neto, Edward R Dougherty
{"title":"Impact of missing value imputation on classification for DNA microarray gene expression data--a model-based study.","authors":"Youting Sun,&nbsp;Ulisses Braga-Neto,&nbsp;Edward R Dougherty","doi":"10.1155/2009/504069","DOIUrl":null,"url":null,"abstract":"<p><p>Many missing-value (MV) imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between MV imputation and classification accuracy. Furthermore, these studies are problematic in fundamental steps such as MV generation and classifier error estimation. In this work, we carry out a model-based study that addresses some of the issues in previous studies. Six popular imputation algorithms, two feature selection methods, and three classification rules are considered. The results suggest that it is beneficial to apply MV imputation when the noise level is high, variance is small, or gene-cluster correlation is strong, under small to moderate MV rates. In these cases, if data quality metrics are available, then it may be helpful to consider the data point with poor quality as missing and apply one of the most robust imputation algorithms to estimate the true signal based on the available high-quality data points. However, at large MV rates, we conclude that imputation methods are not recommended. Regarding the MV rate, our results indicate the presence of a peaking phenomenon: performance of imputation methods actually improves initially as the MV rate increases, but after an optimum point, performance quickly deteriorates with increasing MV rates.</p>","PeriodicalId":72957,"journal":{"name":"EURASIP journal on bioinformatics & systems biology","volume":"2009 ","pages":"504069"},"PeriodicalIF":0.0000,"publicationDate":"2009-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1155/2009/504069","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EURASIP journal on bioinformatics & systems biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2009/504069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2010/3/2 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22

Abstract

Many missing-value (MV) imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between MV imputation and classification accuracy. Furthermore, these studies are problematic in fundamental steps such as MV generation and classifier error estimation. In this work, we carry out a model-based study that addresses some of the issues in previous studies. Six popular imputation algorithms, two feature selection methods, and three classification rules are considered. The results suggest that it is beneficial to apply MV imputation when the noise level is high, variance is small, or gene-cluster correlation is strong, under small to moderate MV rates. In these cases, if data quality metrics are available, then it may be helpful to consider the data point with poor quality as missing and apply one of the most robust imputation algorithms to estimate the true signal based on the available high-quality data points. However, at large MV rates, we conclude that imputation methods are not recommended. Regarding the MV rate, our results indicate the presence of a peaking phenomenon: performance of imputation methods actually improves initially as the MV rate increases, but after an optimum point, performance quickly deteriorates with increasing MV rates.

Abstract Image

Abstract Image

Abstract Image

缺失值输入对DNA微阵列基因表达数据分类的影响——基于模型的研究。
目前,针对微阵列数据已经开发了许多缺失值(MV)输入方法,但仅有少数研究探讨了缺失值输入与分类精度之间的关系。此外,这些研究在MV生成和分类器误差估计等基本步骤上存在问题。在这项工作中,我们开展了一项基于模型的研究,解决了以前研究中的一些问题。考虑了六种常用的插值算法、两种特征选择方法和三种分类规则。结果表明,当噪声水平高、方差小或基因簇相关性强时,在小到中等的MV率下,应用MV归算是有利的。在这些情况下,如果数据质量指标可用,那么将质量差的数据点视为缺失的数据点,并应用最健壮的输入算法之一,以基于可用的高质量数据点估计真实信号,可能会有所帮助。然而,在较大的毫伏率下,我们得出的结论是,不建议采用归算方法。在MV率方面,我们的结果表明存在峰值现象:随着MV率的增加,插补方法的性能实际上最初有所提高,但在最佳点之后,随着MV率的增加,性能迅速恶化。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信