基因组组装的平行弦图构造与传递约简

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-10-20 DOI:10.1109/IPDPS49936.2021.00060

Giulia Guidi, Oguz Selvitopi, Marquita Ellis, L. Oliker, K. Yelick, A. Buluç

{"title":"基因组组装的平行弦图构造与传递约简","authors":"Giulia Guidi, Oguz Selvitopi, Marquita Ellis, L. Oliker, K. Yelick, A. Buluç","doi":"10.1109/IPDPS49936.2021.00060","DOIUrl":null,"url":null,"abstract":"One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2 – $1.3 \\times$ for the human genome and 1.5 – $1.9 \\times$ for C.elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5 – $13.3 \\times$ for the human genome and 18– $29 \\times$ for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly\",\"authors\":\"Giulia Guidi, Oguz Selvitopi, Marquita Ellis, L. Oliker, K. Yelick, A. Buluç\",\"doi\":\"10.1109/IPDPS49936.2021.00060\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2 – $1.3 \\\\times$ for the human genome and 1.5 – $1.9 \\\\times$ for C.elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5 – $13.3 \\\\times$ for the human genome and 18– $29 \\\\times$ for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.\",\"PeriodicalId\":372234,\"journal\":{\"name\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPS49936.2021.00060\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

计算生物学中计算量最大的任务之一是从头基因组组装，即从冗余和错误的短序列中解码未知基因组的序列。一个通用的汇编范例识别重叠的序列，简化它们的布局，并创建共识。尽管文献中开发了许多算法，但大型基因组的有效组装仍然是一个悬而未决的问题。在这项工作中，我们引入了新的分布式内存并行算法，用于从头基因组组装的重叠检测和布局简化步骤，并在diBELLA 2D流水线中实现它们。我们用于重叠检测和布局简化的分布式内存算法是基于使用二维分布式稀疏矩阵的半环上的线性代数操作。我们的布局步骤包括执行从重叠图到字符串图的传递约简。我们对新算法的主要阶段进行了详细的通信分析。diBELLA 2D实现了接近线性的缩放，人类基因组的并行效率超过80%，与最先进的技术相比，人类基因组的重叠检测运行时间减少了1.2 - 1.3美元，秀丽隐杆线虫的重叠检测运行时间减少了1.5 - 1.9美元。我们的传递约简算法优于现有的分布式内存实现，对人类基因组和秀丽隐杆线虫来说，前者优于前者10.5 - 13.3倍，后者优于后者18 - 29倍。我们的工作为利用分布式存储器中的长读取为大型基因组的高效从头组装铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2 – $1.3 \times$ for the human genome and 1.5 – $1.9 \times$ for C.elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5 – $13.3 \times$ for the human genome and 18– $29 \times$ for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量