The Role of Machine Learning in Finding Chimeric RNAs

2015 26th International Workshop on Database and Expert Systems Applications (DEXA) Pub Date : 2015-09-01 DOI:10.1109/DEXA.2015.25

S. Beaumeunier, J. Audoux, A. Boureux, T. Commes, Nicolas Philippe, Ronnie Alves

{"title":"The Role of Machine Learning in Finding Chimeric RNAs","authors":"S. Beaumeunier, J. Audoux, A. Boureux, T. Commes, Nicolas Philippe, Ronnie Alves","doi":"10.1109/DEXA.2015.25","DOIUrl":null,"url":null,"abstract":"High-throughput sequencing technology and bioinformatics have identified chimeric RNAs (chRNAs), raising the possibility of chRNAs expressing particularly in diseases can be used as potential biomarkers in both diagnosis and prognosis. The task of discriminating true chRNA from the false ones poses an interesting Machine Learning (ML) challenge. First of all, the sequencing data may contain false reads due to technical artefacts and during the analysis process, bioinformatics tools may generate false positives due to methodological biases. Thus predicting the real signal from the noise can be a hard task. Furthermore, even if we succeed to have a proper set of observations (enough sequencing data) about true chRNAs, chances are that the devised model can not be able to generalize beyond it. Like any other machine learning problem, the first big issue is finding the good data, observations, to build the prediction model. Unfortunately, as far as we were concerned, there is no common benchmark data available for chRNAs. And, the definition of a classification baseline is lacking in the related literature. In this work we are moving towards a benchmark data and a fair comparison analysis unraveling the role of ML techniques in finding chRNAs. We have developed a benchmark pipeline incorporating a mutated genome process and simulated RNA-seq data by Flux Simulator. These sequencing reads were aligned and annotated by CRAC. CRAC offers a new way to analyze the RNA-seq data by integrating genomic location and local coverage, allowing biological predictions in one step. The resulting data were used as a benchmark for our comparison analysis. We have observed that the no free lunch theorem do not hold for ensemble classifiers. Ensemble learning strategies demonstrated to be more robust to this classification problem, providing an average AUC performance of 95% (ACC=94%, Kappa=0.87%).","PeriodicalId":239815,"journal":{"name":"2015 26th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 26th International Workshop on Database and Expert Systems Applications (DEXA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2015.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

High-throughput sequencing technology and bioinformatics have identified chimeric RNAs (chRNAs), raising the possibility of chRNAs expressing particularly in diseases can be used as potential biomarkers in both diagnosis and prognosis. The task of discriminating true chRNA from the false ones poses an interesting Machine Learning (ML) challenge. First of all, the sequencing data may contain false reads due to technical artefacts and during the analysis process, bioinformatics tools may generate false positives due to methodological biases. Thus predicting the real signal from the noise can be a hard task. Furthermore, even if we succeed to have a proper set of observations (enough sequencing data) about true chRNAs, chances are that the devised model can not be able to generalize beyond it. Like any other machine learning problem, the first big issue is finding the good data, observations, to build the prediction model. Unfortunately, as far as we were concerned, there is no common benchmark data available for chRNAs. And, the definition of a classification baseline is lacking in the related literature. In this work we are moving towards a benchmark data and a fair comparison analysis unraveling the role of ML techniques in finding chRNAs. We have developed a benchmark pipeline incorporating a mutated genome process and simulated RNA-seq data by Flux Simulator. These sequencing reads were aligned and annotated by CRAC. CRAC offers a new way to analyze the RNA-seq data by integrating genomic location and local coverage, allowing biological predictions in one step. The resulting data were used as a benchmark for our comparison analysis. We have observed that the no free lunch theorem do not hold for ensemble classifiers. Ensemble learning strategies demonstrated to be more robust to this classification problem, providing an average AUC performance of 95% (ACC=94%, Kappa=0.87%).

查看原文本刊更多论文

机器学习在寻找嵌合rna中的作用

高通量测序技术和生物信息学已经鉴定出嵌合rna (chRNAs)，提高了在疾病中表达的chRNAs作为诊断和预后的潜在生物标志物的可能性。区分真假chRNA的任务提出了一个有趣的机器学习(ML)挑战。首先，测序数据可能由于技术误差而包含假读数，并且在分析过程中，生物信息学工具可能由于方法偏差而产生假阳性。因此，从噪声中预测真实信号可能是一项艰巨的任务。此外，即使我们成功地获得了关于真正的chrna的一组适当的观察结果(足够的测序数据)，所设计的模型也可能无法推广到它之外。像任何其他机器学习问题一样，第一个大问题是找到好的数据，观察，来建立预测模型。不幸的是，据我们所知，没有可用于chrna的通用基准数据。并且，相关文献缺乏分类基线的定义。在这项工作中，我们正朝着基准数据和公平比较分析的方向发展，揭示ML技术在寻找chrna中的作用。我们通过Flux Simulator开发了一个包含突变基因组过程和模拟RNA-seq数据的基准管道。这些测序读段通过CRAC进行对齐和注释。CRAC提供了一种通过整合基因组位置和局部覆盖来分析RNA-seq数据的新方法，从而一步实现生物学预测。结果数据被用作我们比较分析的基准。我们已经观察到无免费午餐定理对集合分类器不成立。集成学习策略对该分类问题表现出更强的鲁棒性，平均AUC性能为95% (ACC=94%， Kappa=0.87%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 26th International Workshop on Database and Expert Systems Applications (DEXA)

自引率

0.00%

发文量