Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics Pub Date : 2017-08-20 DOI:10.1145/3107411.3107491

Nagakishore Jammula, Sriram P. Chockalingam, S. Aluru

{"title":"Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses","authors":"Nagakishore Jammula, Sriram P. Chockalingam, S. Aluru","doi":"10.1145/3107411.3107491","DOIUrl":null,"url":null,"abstract":"State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. The output sequences are referred to as 'reads'. These read datasets facilitate a wide variety of analyses with applications in areas such as genomics, metagenomics, and transcriptomics. Owing to the large size of the read datasets, such analyses are often compute and memory intensive. In this paper, we present a parallel algorithm for partitioning large-scale read datasets in order to facilitate distributed-memory parallel analyses. During the process of partitioning the read datasets, we construct and partition the associated de Bruijn graph in parallel. This allows applications that make use of a variant of the de Bruijn graph, such as de novo assembly, to directly leverage the generated de Bruijn graph partitions. In addition, we propose a mechanism for evaluating the quality of the generated partitions of reads and demonstrate that our algorithm produces high quality partitions. Our implementation is available at github.com/ParBLiSS/read_partitioning.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3107491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. The output sequences are referred to as 'reads'. These read datasets facilitate a wide variety of analyses with applications in areas such as genomics, metagenomics, and transcriptomics. Owing to the large size of the read datasets, such analyses are often compute and memory intensive. In this paper, we present a parallel algorithm for partitioning large-scale read datasets in order to facilitate distributed-memory parallel analyses. During the process of partitioning the read datasets, we construct and partition the associated de Bruijn graph in parallel. This allows applications that make use of a variant of the de Bruijn graph, such as de novo assembly, to directly leverage the generated de Bruijn graph partitions. In addition, we propose a mechanism for evaluating the quality of the generated partitions of reads and demonstrate that our algorithm produces high quality partitions. Our implementation is available at github.com/ParBLiSS/read_partitioning.

查看原文本刊更多论文

支持并行基因组学分析的高通量测序数据集的分布式内存分区

最先进的高通量测序仪器破译超过十亿短基因组片段每运行。输出序列被称为“读取”。这些读取数据集有助于在基因组学、宏基因组学和转录组学等领域进行各种分析。由于读取数据集的规模很大，这种分析通常需要大量的计算和内存。在本文中，我们提出了一种用于划分大规模读数据集的并行算法，以促进分布式内存并行分析。在划分读数据集的过程中，我们并行地构造和划分相关联的de Bruijn图。这允许使用de Bruijn图的变体的应用程序，例如de novo assembly，直接利用生成的de Bruijn图分区。此外，我们提出了一种评估读取分区质量的机制，并证明了我们的算法产生了高质量的分区。我们的实现可以在github.com/ParBLiSS/read_partitioning上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

自引率

0.00%

发文量