支持并行基因组学分析的高通量测序数据集的分布式内存分区

Nagakishore Jammula, Sriram P. Chockalingam, S. Aluru
{"title":"支持并行基因组学分析的高通量测序数据集的分布式内存分区","authors":"Nagakishore Jammula, Sriram P. Chockalingam, S. Aluru","doi":"10.1145/3107411.3107491","DOIUrl":null,"url":null,"abstract":"State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. The output sequences are referred to as 'reads'. These read datasets facilitate a wide variety of analyses with applications in areas such as genomics, metagenomics, and transcriptomics. Owing to the large size of the read datasets, such analyses are often compute and memory intensive. In this paper, we present a parallel algorithm for partitioning large-scale read datasets in order to facilitate distributed-memory parallel analyses. During the process of partitioning the read datasets, we construct and partition the associated de Bruijn graph in parallel. This allows applications that make use of a variant of the de Bruijn graph, such as de novo assembly, to directly leverage the generated de Bruijn graph partitions. In addition, we propose a mechanism for evaluating the quality of the generated partitions of reads and demonstrate that our algorithm produces high quality partitions. Our implementation is available at github.com/ParBLiSS/read_partitioning.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses\",\"authors\":\"Nagakishore Jammula, Sriram P. Chockalingam, S. Aluru\",\"doi\":\"10.1145/3107411.3107491\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. The output sequences are referred to as 'reads'. These read datasets facilitate a wide variety of analyses with applications in areas such as genomics, metagenomics, and transcriptomics. Owing to the large size of the read datasets, such analyses are often compute and memory intensive. In this paper, we present a parallel algorithm for partitioning large-scale read datasets in order to facilitate distributed-memory parallel analyses. During the process of partitioning the read datasets, we construct and partition the associated de Bruijn graph in parallel. This allows applications that make use of a variant of the de Bruijn graph, such as de novo assembly, to directly leverage the generated de Bruijn graph partitions. In addition, we propose a mechanism for evaluating the quality of the generated partitions of reads and demonstrate that our algorithm produces high quality partitions. Our implementation is available at github.com/ParBLiSS/read_partitioning.\",\"PeriodicalId\":246388,\"journal\":{\"name\":\"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3107411.3107491\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3107491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

最先进的高通量测序仪器破译超过十亿短基因组片段每运行。输出序列被称为“读取”。这些读取数据集有助于在基因组学、宏基因组学和转录组学等领域进行各种分析。由于读取数据集的规模很大,这种分析通常需要大量的计算和内存。在本文中,我们提出了一种用于划分大规模读数据集的并行算法,以促进分布式内存并行分析。在划分读数据集的过程中,我们并行地构造和划分相关联的de Bruijn图。这允许使用de Bruijn图的变体的应用程序,例如de novo assembly,直接利用生成的de Bruijn图分区。此外,我们提出了一种评估读取分区质量的机制,并证明了我们的算法产生了高质量的分区。我们的实现可以在github.com/ParBLiSS/read_partitioning上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses
State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. The output sequences are referred to as 'reads'. These read datasets facilitate a wide variety of analyses with applications in areas such as genomics, metagenomics, and transcriptomics. Owing to the large size of the read datasets, such analyses are often compute and memory intensive. In this paper, we present a parallel algorithm for partitioning large-scale read datasets in order to facilitate distributed-memory parallel analyses. During the process of partitioning the read datasets, we construct and partition the associated de Bruijn graph in parallel. This allows applications that make use of a variant of the de Bruijn graph, such as de novo assembly, to directly leverage the generated de Bruijn graph partitions. In addition, we propose a mechanism for evaluating the quality of the generated partitions of reads and demonstrate that our algorithm produces high quality partitions. Our implementation is available at github.com/ParBLiSS/read_partitioning.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信