Mining sequential patterns from uncertain big DNA in the spark framework

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Pub Date : 2016-12-01 DOI:10.1109/BIBM.2016.7822641

Fan Jiang, C. Leung, O. Sarumi, Christine Y. Zhang

{"title":"Mining sequential patterns from uncertain big DNA in the spark framework","authors":"Fan Jiang, C. Leung, O. Sarumi, Christine Y. Zhang","doi":"10.1109/BIBM.2016.7822641","DOIUrl":null,"url":null,"abstract":"Big data has become ubiquitous as high volumes of wide varieties of valuable data of different veracities (e.g., precise, imprecise or uncertain data) are made available at a high velocity through fast throughput machines and techniques for data gathering and curation in many real life applications in various domains and application areas such as bioinformatics, biomedicine, finance, social networking, and weather forecasting. In bioinformatics, terabytes of deoxyribonucleic acid (DNA) sequences can now be generated within a few hours with the use of next generation sequencing (NGS) technologies such as Illumina HiSeq X and Illumina Genome Analyzer. Due to the nature of these NGS technologies, generated data are usually inherent with some noise or other forms of error. These uncertain data are embedded with a wealth of information in the form of frequent patterns. Mining frequently occurring patterns (e.g., motifs) from these big uncertain DNA sequences is a challenge in bioinformatics and biomedicine. Many existing algorithms are serial and mine DNA sequence motifs using precise data mining methods. Mining of motifs from big DNA sequences is a computationally intensive task because of the high volume and the associated uncertainty of these DNA sequences. In this paper, we propose a scalable algorithm for high performance computing on bioinformatics. Specifically, our parallel algorithm uses a fault-tolerant collection of resilient distributed datasets (RDDs) in Apache Spark computing framework to mine sequence motifs from uncertain big DNA data. Experimental results show that our algorithm extracts accurate motifs within a short time frame.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2016.7822641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Big data has become ubiquitous as high volumes of wide varieties of valuable data of different veracities (e.g., precise, imprecise or uncertain data) are made available at a high velocity through fast throughput machines and techniques for data gathering and curation in many real life applications in various domains and application areas such as bioinformatics, biomedicine, finance, social networking, and weather forecasting. In bioinformatics, terabytes of deoxyribonucleic acid (DNA) sequences can now be generated within a few hours with the use of next generation sequencing (NGS) technologies such as Illumina HiSeq X and Illumina Genome Analyzer. Due to the nature of these NGS technologies, generated data are usually inherent with some noise or other forms of error. These uncertain data are embedded with a wealth of information in the form of frequent patterns. Mining frequently occurring patterns (e.g., motifs) from these big uncertain DNA sequences is a challenge in bioinformatics and biomedicine. Many existing algorithms are serial and mine DNA sequence motifs using precise data mining methods. Mining of motifs from big DNA sequences is a computationally intensive task because of the high volume and the associated uncertainty of these DNA sequences. In this paper, we propose a scalable algorithm for high performance computing on bioinformatics. Specifically, our parallel algorithm uses a fault-tolerant collection of resilient distributed datasets (RDDs) in Apache Spark computing framework to mine sequence motifs from uncertain big DNA data. Experimental results show that our algorithm extracts accurate motifs within a short time frame.

查看原文本刊更多论文

在spark框架中从不确定的大DNA中挖掘序列模式

在生物信息学、生物医学、金融、社交网络和天气预报等不同领域和应用领域中，通过快速吞吐量的机器和数据收集和管理技术，大量各种不同真实性的有价值数据(例如精确、不精确或不确定数据)以高速提供，大数据已经变得无处不在。在生物信息学领域，使用下一代测序(NGS)技术，如Illumina HiSeq X和Illumina Genome Analyzer，现在可以在几个小时内生成tb级的脱氧核糖核酸(DNA)序列。由于这些NGS技术的性质，生成的数据通常带有一些噪声或其他形式的误差。这些不确定的数据以频繁模式的形式嵌入了丰富的信息。从这些大的不确定DNA序列中挖掘频繁出现的模式(例如，基序)是生物信息学和生物医学的一个挑战。现有的许多算法都是串行的，使用精确的数据挖掘方法来挖掘DNA序列基序。由于这些DNA序列的高容量和相关的不确定性，从大DNA序列中挖掘基序是一项计算密集型的任务。在本文中，我们提出了一种可扩展的生物信息学高性能计算算法。具体来说，我们的并行算法使用Apache Spark计算框架中的弹性分布式数据集(rdd)的容错集合，从不确定的大DNA数据中挖掘序列基序。实验结果表明，该算法能在较短的时间内提取出准确的图案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

自引率

0.00%

发文量