Rifki Sadikin, Andria Arisal, Rofithah Omar, N. Mazni
{"title":"Processing next generation sequencing data in map-reduce framework using hadoop-BAM in a computer cluster","authors":"Rifki Sadikin, Andria Arisal, Rofithah Omar, N. Mazni","doi":"10.1109/ICITISEE.2017.8285542","DOIUrl":null,"url":null,"abstract":"Next-Generation Sequencing in bioinformatics produce a massive amount of data volume. Big data technologies are needed to reduce computation time in data processing. In this paper, we implement Hadoop Map-Reduce framework for processing Next-Generation Sequencing using Hadoop-BAM library. Our implementation process a Binary Alignment Map (BAM) file which contains a reference sequence and many aligned/not-aligned reads by spitting the BAM file into Hadoop data blocks. To process the BAM file in a computer cluster, we implement a mapper and a reducer of Hadoop Map-Reduce framework. The mapper processes the BAM file to produce key value pairs. While, the reducer summary the key value pairs into a meaningful output. Here the mapper and reducer are created to summarize the number of bases in a BAM file. We conduct the experiment in a LIPI Hadoop cluster. The cluster consists of 96 CPU cores. The result of our experiments show that our map-reduce implementations are gaining speed-up compare to serial Next-Generation Sequencing with Picard tools.","PeriodicalId":130873,"journal":{"name":"2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICITISEE.2017.8285542","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Next-Generation Sequencing in bioinformatics produce a massive amount of data volume. Big data technologies are needed to reduce computation time in data processing. In this paper, we implement Hadoop Map-Reduce framework for processing Next-Generation Sequencing using Hadoop-BAM library. Our implementation process a Binary Alignment Map (BAM) file which contains a reference sequence and many aligned/not-aligned reads by spitting the BAM file into Hadoop data blocks. To process the BAM file in a computer cluster, we implement a mapper and a reducer of Hadoop Map-Reduce framework. The mapper processes the BAM file to produce key value pairs. While, the reducer summary the key value pairs into a meaningful output. Here the mapper and reducer are created to summarize the number of bases in a BAM file. We conduct the experiment in a LIPI Hadoop cluster. The cluster consists of 96 CPU cores. The result of our experiments show that our map-reduce implementations are gaining speed-up compare to serial Next-Generation Sequencing with Picard tools.