Machine Learning Model for Cancer Diagnosis based on RNAseq Microarray

Hanaa Torkey, Mostafa Atlam, N. El-Fishawy, Hanaa Salem
{"title":"Machine Learning Model for Cancer Diagnosis based on RNAseq Microarray","authors":"Hanaa Torkey, Mostafa Atlam, N. El-Fishawy, Hanaa Salem","doi":"10.21608/mjeer.2020.20533.1000","DOIUrl":null,"url":null,"abstract":"Microarray technology is one of the most important recent breakthroughs in experimental molecular biology. This novel technology for thousands of genes concurrently allows the supervising of expression levels in cells and has been increasingly used in cancer research to understand more of the molecular variations among tumors so that a more reliable classification becomes attainable. Machine learning techniques are loosely used to create substantial and precise classification models. In this paper, a function called Feature Reduction Classification Optimization (FeRCO) is proposed. FeRCO function uses machine learning techniques applied upon RNAseq microarray data for predicting whether the patient is diseased or not. The main purpose of FeRCO function is to define the minimum number of features using the most fitting reduction technique along with classification technique that give the highest classification accuracy. These techniques include Support Vector Machine (SVM) both linear and kernel, Decision Trees (DT), Random Forest (RF), K-Nearest Neighbours (KNN) and Naïve Bayes (NB). Principle Component Analysis (PCA) both linear and kernel, Linear Discriminant Analysis (LDA) and Factor Analysis (FA) along with different machine learning techniques were used to find a lower-dimensional subspace with better discriminatory features for better classification. The major outcomes of this research can be considered as a roadmap for interesting researchers in this field to be able to choose the most suitable machine learning algorithm whatever classification or reduction. The results show that FA and LPCA are the best reduction techniques to be used with the three datasets providing an accuracy up to 100% with TCGA and simulation datasets and accuracy up to 97.86% with WDBC datasets. LSVM is the best classification technique to be used with Linear PCA (LPCA), FA and LDA. RF is the best classification technique to be used with Kernel PCA (KPCA). Keywords— Cancer Classification, Diagnosis, Gene Expression, Gene Reduction, Machine learning.","PeriodicalId":218019,"journal":{"name":"Menoufia Journal of Electronic Engineering Research","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Menoufia Journal of Electronic Engineering Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21608/mjeer.2020.20533.1000","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Microarray technology is one of the most important recent breakthroughs in experimental molecular biology. This novel technology for thousands of genes concurrently allows the supervising of expression levels in cells and has been increasingly used in cancer research to understand more of the molecular variations among tumors so that a more reliable classification becomes attainable. Machine learning techniques are loosely used to create substantial and precise classification models. In this paper, a function called Feature Reduction Classification Optimization (FeRCO) is proposed. FeRCO function uses machine learning techniques applied upon RNAseq microarray data for predicting whether the patient is diseased or not. The main purpose of FeRCO function is to define the minimum number of features using the most fitting reduction technique along with classification technique that give the highest classification accuracy. These techniques include Support Vector Machine (SVM) both linear and kernel, Decision Trees (DT), Random Forest (RF), K-Nearest Neighbours (KNN) and Naïve Bayes (NB). Principle Component Analysis (PCA) both linear and kernel, Linear Discriminant Analysis (LDA) and Factor Analysis (FA) along with different machine learning techniques were used to find a lower-dimensional subspace with better discriminatory features for better classification. The major outcomes of this research can be considered as a roadmap for interesting researchers in this field to be able to choose the most suitable machine learning algorithm whatever classification or reduction. The results show that FA and LPCA are the best reduction techniques to be used with the three datasets providing an accuracy up to 100% with TCGA and simulation datasets and accuracy up to 97.86% with WDBC datasets. LSVM is the best classification technique to be used with Linear PCA (LPCA), FA and LDA. RF is the best classification technique to be used with Kernel PCA (KPCA). Keywords— Cancer Classification, Diagnosis, Gene Expression, Gene Reduction, Machine learning.
基于RNAseq芯片的癌症诊断机器学习模型
微阵列技术是近年来实验分子生物学领域最重要的突破之一。这项新技术可以同时监测数千个基因在细胞中的表达水平,并越来越多地用于癌症研究,以了解更多的肿瘤分子变异,从而实现更可靠的分类。机器学习技术被松散地用于创建大量和精确的分类模型。本文提出了一种特征约简分类优化(FeRCO)函数。FeRCO函数使用应用于RNAseq微阵列数据的机器学习技术来预测患者是否患病。FeRCO函数的主要目的是使用最拟合的约简技术和分类技术来定义最小数量的特征,从而获得最高的分类精度。这些技术包括线性和核支持向量机(SVM)、决策树(DT)、随机森林(RF)、k近邻(KNN)和Naïve贝叶斯(NB)。采用线性和核主成分分析(PCA)、线性判别分析(LDA)和因子分析(FA)以及不同的机器学习技术,寻找具有更好判别特征的低维子空间,以进行更好的分类。本研究的主要成果可以被视为该领域有趣的研究人员能够选择最适合的机器学习算法的路线图,无论是分类还是约简。结果表明,FA和LPCA是三种数据集的最佳约简技术,对TCGA和模拟数据集的准确率可达100%,对WDBC数据集的准确率可达97.86%。LSVM是与线性主成分分析(LPCA)、FA和LDA结合使用的最佳分类技术。RF是核主成分分析(KPCA)的最佳分类技术。关键词:癌症分类,诊断,基因表达,基因还原,机器学习。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信