Impact of missing value imputation on classification for DNA microarray gene expression data--a model-based study.

EURASIP journal on bioinformatics & systems biology Pub Date : 2009-01-01 Epub Date: 2010-03-02 DOI:10.1155/2009/504069

Youting Sun, Ulisses Braga-Neto, Edward R Dougherty

{"title":"Impact of missing value imputation on classification for DNA microarray gene expression data--a model-based study.","authors":"Youting Sun, Ulisses Braga-Neto, Edward R Dougherty","doi":"10.1155/2009/504069","DOIUrl":null,"url":null,"abstract":"<p><p>Many missing-value (MV) imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between MV imputation and classification accuracy. Furthermore, these studies are problematic in fundamental steps such as MV generation and classifier error estimation. In this work, we carry out a model-based study that addresses some of the issues in previous studies. Six popular imputation algorithms, two feature selection methods, and three classification rules are considered. The results suggest that it is beneficial to apply MV imputation when the noise level is high, variance is small, or gene-cluster correlation is strong, under small to moderate MV rates. In these cases, if data quality metrics are available, then it may be helpful to consider the data point with poor quality as missing and apply one of the most robust imputation algorithms to estimate the true signal based on the available high-quality data points. However, at large MV rates, we conclude that imputation methods are not recommended. Regarding the MV rate, our results indicate the presence of a peaking phenomenon: performance of imputation methods actually improves initially as the MV rate increases, but after an optimum point, performance quickly deteriorates with increasing MV rates.</p>","PeriodicalId":72957,"journal":{"name":"EURASIP journal on bioinformatics & systems biology","volume":"2009 ","pages":"504069"},"PeriodicalIF":0.0000,"publicationDate":"2009-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1155/2009/504069","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EURASIP journal on bioinformatics & systems biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2009/504069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2010/3/2 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

Many missing-value (MV) imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between MV imputation and classification accuracy. Furthermore, these studies are problematic in fundamental steps such as MV generation and classifier error estimation. In this work, we carry out a model-based study that addresses some of the issues in previous studies. Six popular imputation algorithms, two feature selection methods, and three classification rules are considered. The results suggest that it is beneficial to apply MV imputation when the noise level is high, variance is small, or gene-cluster correlation is strong, under small to moderate MV rates. In these cases, if data quality metrics are available, then it may be helpful to consider the data point with poor quality as missing and apply one of the most robust imputation algorithms to estimate the true signal based on the available high-quality data points. However, at large MV rates, we conclude that imputation methods are not recommended. Regarding the MV rate, our results indicate the presence of a peaking phenomenon: performance of imputation methods actually improves initially as the MV rate increases, but after an optimum point, performance quickly deteriorates with increasing MV rates.

Abstract Image

查看原文本刊更多论文

缺失值输入对DNA微阵列基因表达数据分类的影响——基于模型的研究。

目前，针对微阵列数据已经开发了许多缺失值(MV)输入方法，但仅有少数研究探讨了缺失值输入与分类精度之间的关系。此外，这些研究在MV生成和分类器误差估计等基本步骤上存在问题。在这项工作中，我们开展了一项基于模型的研究，解决了以前研究中的一些问题。考虑了六种常用的插值算法、两种特征选择方法和三种分类规则。结果表明，当噪声水平高、方差小或基因簇相关性强时，在小到中等的MV率下，应用MV归算是有利的。在这些情况下，如果数据质量指标可用，那么将质量差的数据点视为缺失的数据点，并应用最健壮的输入算法之一，以基于可用的高质量数据点估计真实信号，可能会有所帮助。然而，在较大的毫伏率下，我们得出的结论是，不建议采用归算方法。在MV率方面，我们的结果表明存在峰值现象:随着MV率的增加，插补方法的性能实际上最初有所提高，但在最佳点之后，随着MV率的增加，性能迅速恶化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

EURASIP journal on bioinformatics & systems biology

自引率

0.00%

发文量