{"title":"Hybrid DE-SVM Approach for Feature Selection: Application to Gene Expression Datasets","authors":"J. García-Nieto, E. Alba, Javier Apolloni","doi":"10.1109/LINDI.2009.5258761","DOIUrl":null,"url":null,"abstract":"The efficient selection of predictive and accurate gene subsets for cell-type classification is nowadays a crucial problem in Microarray data analysis. The application and combination of dedicated computational intelligence methods holds a great promise for tackling the feature selection and classification. In this work we present a Differential Evolution (DE) approach for the efficient automated gene subset selection. In this model, the selected subsets are evaluated by means of their classification rate using a Support Vector Machines (SVM) classifier. The proposed approach is tested on DLBCL Lymphoma and Colon Tumor gene expression datasets. Experiments lying in effectiveness and biological analyses of the results, in addition to comparisons with related methods in the literature, indicate that our DE-SVM model is highly reliable and competitive. I. INTRODUCTION DNA Microarrays (MA) (13) allow the scientists to simulta- neously analyze thousands of genes, and thus giving important insights about cell's function, since changes in the physio-logy of an organism are generally associated with changes in gene ensembles of expression patterns. The vast amount of data involved in a typical Microarray experiment usually requires to perform a complex statistical analysis, with the important goal of making the classification of the dataset into correct classes. The key issue in this classification is to identify significant and representative gene subsets that may be used to predict class membership for new external samples. Furthermore, these subsets should be as small as possible in order to develop fast and low consuming processes for the future class prediction. The main difficulty in Microarray classification versus other domains is the availability of a relatively small number of samples in comparison with the number of genes in each sample (between 2,000 and more than 10,000 in MA). In addition, expression data are highly redundant and noisy, and of most genes are believed to be uninformative with respect to studied classes, as only a fraction of genes may present distinct profiles for different classes of samples. In this context, machine learning techniques have been applied to handle with large and heterogeneous datasets, since they are capable to isolate the useful information by rejecting redundancies. Concretely, feature selection is often considered as a necessary preprocess step to analyze large datasets, as this method can reduce the dimensionality of the datasets and often conducts to better analyses (9). Feature selection (gene selection in Biology) for gene expression analysis in cancer prediction often uses wrapper classification methods to discriminate a type of tumor (9), (11), to reduce the number of genes to investigate in case of a new patient, and also to assist in drug discovery and early diagnosis. The formal definition of the feature selection problem that we consider here is given as follows:","PeriodicalId":306564,"journal":{"name":"2009 2nd International Symposium on Logistics and Industrial Informatics","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 2nd International Symposium on Logistics and Industrial Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/LINDI.2009.5258761","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17
Abstract
The efficient selection of predictive and accurate gene subsets for cell-type classification is nowadays a crucial problem in Microarray data analysis. The application and combination of dedicated computational intelligence methods holds a great promise for tackling the feature selection and classification. In this work we present a Differential Evolution (DE) approach for the efficient automated gene subset selection. In this model, the selected subsets are evaluated by means of their classification rate using a Support Vector Machines (SVM) classifier. The proposed approach is tested on DLBCL Lymphoma and Colon Tumor gene expression datasets. Experiments lying in effectiveness and biological analyses of the results, in addition to comparisons with related methods in the literature, indicate that our DE-SVM model is highly reliable and competitive. I. INTRODUCTION DNA Microarrays (MA) (13) allow the scientists to simulta- neously analyze thousands of genes, and thus giving important insights about cell's function, since changes in the physio-logy of an organism are generally associated with changes in gene ensembles of expression patterns. The vast amount of data involved in a typical Microarray experiment usually requires to perform a complex statistical analysis, with the important goal of making the classification of the dataset into correct classes. The key issue in this classification is to identify significant and representative gene subsets that may be used to predict class membership for new external samples. Furthermore, these subsets should be as small as possible in order to develop fast and low consuming processes for the future class prediction. The main difficulty in Microarray classification versus other domains is the availability of a relatively small number of samples in comparison with the number of genes in each sample (between 2,000 and more than 10,000 in MA). In addition, expression data are highly redundant and noisy, and of most genes are believed to be uninformative with respect to studied classes, as only a fraction of genes may present distinct profiles for different classes of samples. In this context, machine learning techniques have been applied to handle with large and heterogeneous datasets, since they are capable to isolate the useful information by rejecting redundancies. Concretely, feature selection is often considered as a necessary preprocess step to analyze large datasets, as this method can reduce the dimensionality of the datasets and often conducts to better analyses (9). Feature selection (gene selection in Biology) for gene expression analysis in cancer prediction often uses wrapper classification methods to discriminate a type of tumor (9), (11), to reduce the number of genes to investigate in case of a new patient, and also to assist in drug discovery and early diagnosis. The formal definition of the feature selection problem that we consider here is given as follows: