Novel algorithm to extract multiple solutions for RNA sequence classification problem

2019 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2019-07-01 DOI:10.1109/HPCS48598.2019.9188203

Naoual Guannoni, F. Mhamdi, Emanuel Weitschek, M. Elloumi

{"title":"Novel algorithm to extract multiple solutions for RNA sequence classification problem","authors":"Naoual Guannoni, F. Mhamdi, Emanuel Weitschek, M. Elloumi","doi":"10.1109/HPCS48598.2019.9188203","DOIUrl":null,"url":null,"abstract":"Knowledge extraction methods from Next Generation Sequencing Data (NGS) are highly requested nowadays. This technology has led to an explosion in the amount of genomic data. However, the efficiency of N GS has posed a challenge for analysis this vast genomic data, gene interaction and expression studies. In this work, we focus on RNA-seq gene expression analysis and specifically of cancer disease studies with rule-based supervised classification algorithms that build a model able to discriminate tumoral to normal cases. State of the art algorithms compute just a single classification model that contains few features. On the contrary, the goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the features related to an investigated class. Major efforts have been made in this field with rule-based algorithms (CAMUR method) and an initial step has been realized with tree-based ones. In this paper, we propose a new method that extracts multiple and equivalent classification methods. This method integrates a rule-based classification method and a feature elimination technique in order to obtain more compact, exact, and interpretable models in a reduced execution time. We analyze an RNA-seq of breast cancer data set extracted from The Cancer Genome Atlas (TCGA) and we compare our results with the existing method (CAMUR). Experimental results show the efficacy of our proposed method. We obtain several reliable and efficient classification models compared to CAMUR method. Also, our method is faster than CAMUR algorithm.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS48598.2019.9188203","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Knowledge extraction methods from Next Generation Sequencing Data (NGS) are highly requested nowadays. This technology has led to an explosion in the amount of genomic data. However, the efficiency of N GS has posed a challenge for analysis this vast genomic data, gene interaction and expression studies. In this work, we focus on RNA-seq gene expression analysis and specifically of cancer disease studies with rule-based supervised classification algorithms that build a model able to discriminate tumoral to normal cases. State of the art algorithms compute just a single classification model that contains few features. On the contrary, the goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the features related to an investigated class. Major efforts have been made in this field with rule-based algorithms (CAMUR method) and an initial step has been realized with tree-based ones. In this paper, we propose a new method that extracts multiple and equivalent classification methods. This method integrates a rule-based classification method and a feature elimination technique in order to obtain more compact, exact, and interpretable models in a reduced execution time. We analyze an RNA-seq of breast cancer data set extracted from The Cancer Genome Atlas (TCGA) and we compare our results with the existing method (CAMUR). Experimental results show the efficacy of our proposed method. We obtain several reliable and efficient classification models compared to CAMUR method. Also, our method is faster than CAMUR algorithm.

查看原文本刊更多论文

RNA序列分类问题的多解提取新算法

下一代测序数据(NGS)的知识提取方法是当前研究的热点。这项技术导致了基因组数据量的爆炸式增长。然而，N - GS的效率对大量基因组数据的分析、基因相互作用和表达研究提出了挑战。在这项工作中，我们专注于RNA-seq基因表达分析，特别是基于规则的监督分类算法的癌症疾病研究，该算法建立了一个能够区分肿瘤和正常病例的模型。目前最先进的算法只计算包含很少特征的单个分类模型。相反，目标是通过计算许多分类模型来获得更多的知识，从而识别与所研究的类相关的大多数特征。基于规则的算法(CAMUR方法)已经在这一领域做了大量的工作，而基于树的算法已经迈出了初步的一步。本文提出了一种提取多个等价分类方法的新方法。该方法将基于规则的分类方法与特征消去技术相结合，在更短的执行时间内获得更紧凑、精确和可解释的模型。我们分析了从癌症基因组图谱(TCGA)中提取的乳腺癌数据集的RNA-seq，并将结果与现有方法(CAMUR)进行了比较。实验结果表明了该方法的有效性。与CAMUR方法相比，我们得到了几个可靠、高效的分类模型。同时，我们的方法比CAMUR算法更快。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量