Mostafa Atlam, Hanaa Torkey, Hanaa Salem, N. El-Fishawy
{"title":"A New Feature Selection Method for Enhancing Cancer Diagnosis Based on DNA Microarray","authors":"Mostafa Atlam, Hanaa Torkey, Hanaa Salem, N. El-Fishawy","doi":"10.1109/NRSC49500.2020.9235095","DOIUrl":null,"url":null,"abstract":"Accurately classifying medical data is critical for improving diagnostic prediction system and identifying threptic targets for treatments. Analysing gene expression data has a major challenge in extracting disease-related genes from the large number of genes output from next generation sequencing technology. Therefore, eliminating irrelevant and redundant genes is a major step to process data for prediction. Our objective is to predict more accurately the presence of cancer disease in a sample cell from the gene expression.In this paper, we create a function called Classification Technique as Feature Selection (CTFS) as a new feature selection (FS) method to extract a subset (small number) of genes from classified big number of genes expression to improve cancer prediction result. The enrolled classification techniques in CTFS function for selection are K-Nearest Neighbors (K-NN) and Extreme Gradient Boosting (XGBoosting) optimized by Bayesian Parameter Tuning (BPT). The feature selection methods used to investigate the performance of CTFS function are Univariate Feature Selection (UFS) and Feature Importance (FI). The classification stage is carried out after the feature selection stage using three machine learning (ML) algorithms, Naïve Bayes (NB), Linear Support Vector Machine (LSVM), and Random Forest (RF). Results shows that, using XGBoosting optimized by BPT for FS outperforms FI method in terms of increasing the prediction accuracies along with minimum number of features but with higher running time. The performance of K-NN in FS outperforms all other FS methods in terms of accuracies providing an accuracy that is up to 100% when applied with LSVM on simulation dataset.","PeriodicalId":6778,"journal":{"name":"2020 37th National Radio Science Conference (NRSC)","volume":"44 1","pages":"285-295"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 37th National Radio Science Conference (NRSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NRSC49500.2020.9235095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Accurately classifying medical data is critical for improving diagnostic prediction system and identifying threptic targets for treatments. Analysing gene expression data has a major challenge in extracting disease-related genes from the large number of genes output from next generation sequencing technology. Therefore, eliminating irrelevant and redundant genes is a major step to process data for prediction. Our objective is to predict more accurately the presence of cancer disease in a sample cell from the gene expression.In this paper, we create a function called Classification Technique as Feature Selection (CTFS) as a new feature selection (FS) method to extract a subset (small number) of genes from classified big number of genes expression to improve cancer prediction result. The enrolled classification techniques in CTFS function for selection are K-Nearest Neighbors (K-NN) and Extreme Gradient Boosting (XGBoosting) optimized by Bayesian Parameter Tuning (BPT). The feature selection methods used to investigate the performance of CTFS function are Univariate Feature Selection (UFS) and Feature Importance (FI). The classification stage is carried out after the feature selection stage using three machine learning (ML) algorithms, Naïve Bayes (NB), Linear Support Vector Machine (LSVM), and Random Forest (RF). Results shows that, using XGBoosting optimized by BPT for FS outperforms FI method in terms of increasing the prediction accuracies along with minimum number of features but with higher running time. The performance of K-NN in FS outperforms all other FS methods in terms of accuracies providing an accuracy that is up to 100% when applied with LSVM on simulation dataset.