Nagakishore Jammula, Sriram P. Chockalingam, S. Aluru
{"title":"Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses","authors":"Nagakishore Jammula, Sriram P. Chockalingam, S. Aluru","doi":"10.1145/3107411.3107491","DOIUrl":null,"url":null,"abstract":"State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. The output sequences are referred to as 'reads'. These read datasets facilitate a wide variety of analyses with applications in areas such as genomics, metagenomics, and transcriptomics. Owing to the large size of the read datasets, such analyses are often compute and memory intensive. In this paper, we present a parallel algorithm for partitioning large-scale read datasets in order to facilitate distributed-memory parallel analyses. During the process of partitioning the read datasets, we construct and partition the associated de Bruijn graph in parallel. This allows applications that make use of a variant of the de Bruijn graph, such as de novo assembly, to directly leverage the generated de Bruijn graph partitions. In addition, we propose a mechanism for evaluating the quality of the generated partitions of reads and demonstrate that our algorithm produces high quality partitions. Our implementation is available at github.com/ParBLiSS/read_partitioning.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3107491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. The output sequences are referred to as 'reads'. These read datasets facilitate a wide variety of analyses with applications in areas such as genomics, metagenomics, and transcriptomics. Owing to the large size of the read datasets, such analyses are often compute and memory intensive. In this paper, we present a parallel algorithm for partitioning large-scale read datasets in order to facilitate distributed-memory parallel analyses. During the process of partitioning the read datasets, we construct and partition the associated de Bruijn graph in parallel. This allows applications that make use of a variant of the de Bruijn graph, such as de novo assembly, to directly leverage the generated de Bruijn graph partitions. In addition, we propose a mechanism for evaluating the quality of the generated partitions of reads and demonstrate that our algorithm produces high quality partitions. Our implementation is available at github.com/ParBLiSS/read_partitioning.