Processing next generation sequencing data in map-reduce framework using hadoop-BAM in a computer cluster

2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE) Pub Date : 2017-11-01 DOI:10.1109/ICITISEE.2017.8285542

Rifki Sadikin, Andria Arisal, Rofithah Omar, N. Mazni

{"title":"Processing next generation sequencing data in map-reduce framework using hadoop-BAM in a computer cluster","authors":"Rifki Sadikin, Andria Arisal, Rofithah Omar, N. Mazni","doi":"10.1109/ICITISEE.2017.8285542","DOIUrl":null,"url":null,"abstract":"Next-Generation Sequencing in bioinformatics produce a massive amount of data volume. Big data technologies are needed to reduce computation time in data processing. In this paper, we implement Hadoop Map-Reduce framework for processing Next-Generation Sequencing using Hadoop-BAM library. Our implementation process a Binary Alignment Map (BAM) file which contains a reference sequence and many aligned/not-aligned reads by spitting the BAM file into Hadoop data blocks. To process the BAM file in a computer cluster, we implement a mapper and a reducer of Hadoop Map-Reduce framework. The mapper processes the BAM file to produce key value pairs. While, the reducer summary the key value pairs into a meaningful output. Here the mapper and reducer are created to summarize the number of bases in a BAM file. We conduct the experiment in a LIPI Hadoop cluster. The cluster consists of 96 CPU cores. The result of our experiments show that our map-reduce implementations are gaining speed-up compare to serial Next-Generation Sequencing with Picard tools.","PeriodicalId":130873,"journal":{"name":"2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITISEE.2017.8285542","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Next-Generation Sequencing in bioinformatics produce a massive amount of data volume. Big data technologies are needed to reduce computation time in data processing. In this paper, we implement Hadoop Map-Reduce framework for processing Next-Generation Sequencing using Hadoop-BAM library. Our implementation process a Binary Alignment Map (BAM) file which contains a reference sequence and many aligned/not-aligned reads by spitting the BAM file into Hadoop data blocks. To process the BAM file in a computer cluster, we implement a mapper and a reducer of Hadoop Map-Reduce framework. The mapper processes the BAM file to produce key value pairs. While, the reducer summary the key value pairs into a meaningful output. Here the mapper and reducer are created to summarize the number of bases in a BAM file. We conduct the experiment in a LIPI Hadoop cluster. The cluster consists of 96 CPU cores. The result of our experiments show that our map-reduce implementations are gaining speed-up compare to serial Next-Generation Sequencing with Picard tools.

查看原文本刊更多论文

在计算机集群中使用hadoop-BAM在map-reduce框架中处理下一代测序数据

生物信息学中的下一代测序产生了大量的数据量。需要大数据技术来减少数据处理的计算时间。在本文中，我们使用Hadoop- bam库实现了Hadoop Map-Reduce框架来处理下一代测序。我们的实现过程是一个二进制对齐映射(BAM)文件，它包含一个引用序列和许多对齐/不对齐的读取，方法是将BAM文件放入Hadoop数据块中。为了在计算机集群中处理BAM文件，我们实现了Hadoop Map-Reduce框架的mapper和reducer。映射器处理BAM文件以生成键值对。同时，reducer将键值对汇总为有意义的输出。这里创建了映射器和减速器来汇总BAM文件中的碱基数量。我们在一个LIPI Hadoop集群中进行实验。集群共96个CPU核。我们的实验结果表明，与使用Picard工具的串行下一代测序相比，我们的map-reduce实现获得了更快的速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)

自引率

0.00%

发文量