Hybrid DE-SVM Approach for Feature Selection: Application to Gene Expression Datasets

2009 2nd International Symposium on Logistics and Industrial Informatics Pub Date : 2009-09-25 DOI:10.1109/LINDI.2009.5258761

J. García-Nieto, E. Alba, Javier Apolloni

{"title":"Hybrid DE-SVM Approach for Feature Selection: Application to Gene Expression Datasets","authors":"J. García-Nieto, E. Alba, Javier Apolloni","doi":"10.1109/LINDI.2009.5258761","DOIUrl":null,"url":null,"abstract":"The efficient selection of predictive and accurate gene subsets for cell-type classification is nowadays a crucial problem in Microarray data analysis. The application and combination of dedicated computational intelligence methods holds a great promise for tackling the feature selection and classification. In this work we present a Differential Evolution (DE) approach for the efficient automated gene subset selection. In this model, the selected subsets are evaluated by means of their classification rate using a Support Vector Machines (SVM) classifier. The proposed approach is tested on DLBCL Lymphoma and Colon Tumor gene expression datasets. Experiments lying in effectiveness and biological analyses of the results, in addition to comparisons with related methods in the literature, indicate that our DE-SVM model is highly reliable and competitive. I. INTRODUCTION DNA Microarrays (MA) (13) allow the scientists to simulta- neously analyze thousands of genes, and thus giving important insights about cell's function, since changes in the physio-logy of an organism are generally associated with changes in gene ensembles of expression patterns. The vast amount of data involved in a typical Microarray experiment usually requires to perform a complex statistical analysis, with the important goal of making the classification of the dataset into correct classes. The key issue in this classification is to identify significant and representative gene subsets that may be used to predict class membership for new external samples. Furthermore, these subsets should be as small as possible in order to develop fast and low consuming processes for the future class prediction. The main difficulty in Microarray classification versus other domains is the availability of a relatively small number of samples in comparison with the number of genes in each sample (between 2,000 and more than 10,000 in MA). In addition, expression data are highly redundant and noisy, and of most genes are believed to be uninformative with respect to studied classes, as only a fraction of genes may present distinct profiles for different classes of samples. In this context, machine learning techniques have been applied to handle with large and heterogeneous datasets, since they are capable to isolate the useful information by rejecting redundancies. Concretely, feature selection is often considered as a necessary preprocess step to analyze large datasets, as this method can reduce the dimensionality of the datasets and often conducts to better analyses (9). Feature selection (gene selection in Biology) for gene expression analysis in cancer prediction often uses wrapper classification methods to discriminate a type of tumor (9), (11), to reduce the number of genes to investigate in case of a new patient, and also to assist in drug discovery and early diagnosis. The formal definition of the feature selection problem that we consider here is given as follows:","PeriodicalId":306564,"journal":{"name":"2009 2nd International Symposium on Logistics and Industrial Informatics","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 2nd International Symposium on Logistics and Industrial Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/LINDI.2009.5258761","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

The efficient selection of predictive and accurate gene subsets for cell-type classification is nowadays a crucial problem in Microarray data analysis. The application and combination of dedicated computational intelligence methods holds a great promise for tackling the feature selection and classification. In this work we present a Differential Evolution (DE) approach for the efficient automated gene subset selection. In this model, the selected subsets are evaluated by means of their classification rate using a Support Vector Machines (SVM) classifier. The proposed approach is tested on DLBCL Lymphoma and Colon Tumor gene expression datasets. Experiments lying in effectiveness and biological analyses of the results, in addition to comparisons with related methods in the literature, indicate that our DE-SVM model is highly reliable and competitive. I. INTRODUCTION DNA Microarrays (MA) (13) allow the scientists to simulta- neously analyze thousands of genes, and thus giving important insights about cell's function, since changes in the physio-logy of an organism are generally associated with changes in gene ensembles of expression patterns. The vast amount of data involved in a typical Microarray experiment usually requires to perform a complex statistical analysis, with the important goal of making the classification of the dataset into correct classes. The key issue in this classification is to identify significant and representative gene subsets that may be used to predict class membership for new external samples. Furthermore, these subsets should be as small as possible in order to develop fast and low consuming processes for the future class prediction. The main difficulty in Microarray classification versus other domains is the availability of a relatively small number of samples in comparison with the number of genes in each sample (between 2,000 and more than 10,000 in MA). In addition, expression data are highly redundant and noisy, and of most genes are believed to be uninformative with respect to studied classes, as only a fraction of genes may present distinct profiles for different classes of samples. In this context, machine learning techniques have been applied to handle with large and heterogeneous datasets, since they are capable to isolate the useful information by rejecting redundancies. Concretely, feature selection is often considered as a necessary preprocess step to analyze large datasets, as this method can reduce the dimensionality of the datasets and often conducts to better analyses (9). Feature selection (gene selection in Biology) for gene expression analysis in cancer prediction often uses wrapper classification methods to discriminate a type of tumor (9), (11), to reduce the number of genes to investigate in case of a new patient, and also to assist in drug discovery and early diagnosis. The formal definition of the feature selection problem that we consider here is given as follows:

查看原文本刊更多论文

混合DE-SVM特征选择方法:在基因表达数据集上的应用

有效地选择可预测和准确的基因亚群用于细胞类型分类是目前微阵列数据分析中的一个关键问题。专门的计算智能方法的应用和组合为解决特征选择和分类问题提供了很大的希望。在这项工作中，我们提出了一种有效的自动基因子集选择的差分进化(DE)方法。在该模型中，使用支持向量机(SVM)分类器对所选子集的分类率进行评估。该方法在DLBCL淋巴瘤和结肠肿瘤基因表达数据集上进行了测试。有效性实验和结果的生物学分析，以及与文献中相关方法的比较表明，我们的DE-SVM模型具有较高的可靠性和竞争力。DNA微阵列(MA)(13)使科学家能够同时分析数千个基因，从而对细胞功能提供重要的见解，因为生物体的生理变化通常与表达模式的基因集合的变化有关。典型的微阵列实验中涉及的大量数据通常需要进行复杂的统计分析，其重要目标是将数据集分类为正确的类别。这种分类的关键问题是确定重要的和有代表性的基因子集，这些基因子集可用于预测新的外部样本的类隶属性。此外，这些子集应该尽可能小，以便为未来的类预测开发快速和低消耗的过程。与其他结构域相比，微阵列分类的主要困难是与每个样本中的基因数量(MA在2,000到10,000之间)相比，样品数量相对较少。此外，表达数据是高度冗余和嘈杂的，并且大多数基因被认为对所研究的类别没有信息，因为只有一小部分基因可能在不同类别的样本中表现出不同的特征。在这种情况下，机器学习技术已被应用于处理大型和异构数据集，因为它们能够通过拒绝冗余来隔离有用的信息。具体来说，特征选择通常被认为是分析大型数据集的必要预处理步骤，因为这种方法可以降低数据集的维数，通常可以更好地进行分析(9)。癌症预测中基因表达分析的特征选择(生物学中的基因选择)通常使用包装分类方法来区分一种肿瘤(9)，(11)，以减少新患者情况下需要调查的基因数量。同时也协助药物发现和早期诊断。我们这里考虑的特征选择问题的正式定义如下:

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 2nd International Symposium on Logistics and Industrial Informatics

自引率

0.00%

发文量