分布式存储器中基于排序的高性能 k-mer 计数与灵活的混合并行性

Yifan Li, Giulia Guidi
{"title":"分布式存储器中基于排序的高性能 k-mer 计数与灵活的混合并行性","authors":"Yifan Li, Giulia Guidi","doi":"arxiv-2407.07718","DOIUrl":null,"url":null,"abstract":"In generating large quantities of DNA data, high-throughput sequencing\ntechnologies require advanced bioinformatics infrastructures for efficient data\nanalysis. k-mer counting, the process of quantifying the frequency of\nfixed-length k DNA subsequences, is a fundamental step in various\nbioinformatics pipelines, including genome assembly and protein prediction. Due\nto the growing volume of data, the scaling of the counting process is critical.\nIn the literature, distributed memory software uses hash tables, which exhibit\npoor cache friendliness and consume excessive memory. They often also lack\nsupport for flexible parallelism, which makes integration into existing\nbioinformatics pipelines difficult. In this work, we propose HySortK, a highly\nefficient sorting-based distributed memory k-mer counter. HySortK reduces the\ncommunication volume through a carefully designed communication scheme and\ndomain-specific optimization strategies. Furthermore, we introduce an abstract\ntask layer for flexible hybrid parallelism to address load imbalances in\ndifferent scenarios. HySortK achieves a 2-10x speedup compared to the GPU\nbaseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK\nachieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.\nFinally, we integrated HySortK into an existing genome assembly pipeline and\nachieved up to 1.8x speedup, proving its flexibility and practicality in\nreal-world scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism\",\"authors\":\"Yifan Li, Giulia Guidi\",\"doi\":\"arxiv-2407.07718\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In generating large quantities of DNA data, high-throughput sequencing\\ntechnologies require advanced bioinformatics infrastructures for efficient data\\nanalysis. k-mer counting, the process of quantifying the frequency of\\nfixed-length k DNA subsequences, is a fundamental step in various\\nbioinformatics pipelines, including genome assembly and protein prediction. Due\\nto the growing volume of data, the scaling of the counting process is critical.\\nIn the literature, distributed memory software uses hash tables, which exhibit\\npoor cache friendliness and consume excessive memory. They often also lack\\nsupport for flexible parallelism, which makes integration into existing\\nbioinformatics pipelines difficult. In this work, we propose HySortK, a highly\\nefficient sorting-based distributed memory k-mer counter. HySortK reduces the\\ncommunication volume through a carefully designed communication scheme and\\ndomain-specific optimization strategies. Furthermore, we introduce an abstract\\ntask layer for flexible hybrid parallelism to address load imbalances in\\ndifferent scenarios. HySortK achieves a 2-10x speedup compared to the GPU\\nbaseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK\\nachieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.\\nFinally, we integrated HySortK into an existing genome assembly pipeline and\\nachieved up to 1.8x speedup, proving its flexibility and practicality in\\nreal-world scenarios.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"23 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.07718\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.07718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在生成大量 DNA 数据的过程中,高通量测序技术需要先进的生物信息学基础设施来进行高效的数据分析。k-mer 计数是量化固定长度 k DNA 子序列频率的过程,是基因组组装和蛋白质预测等各种生物信息学流水线的基本步骤。随着数据量的不断增长,计数过程的扩展至关重要。它们通常还缺乏对灵活并行性的支持,因此很难集成到现有的生物信息学流水线中。在这项工作中,我们提出了基于高效排序的分布式内存 k-mer 计数器 HySortK。HySortK 通过精心设计的通信方案和特定领域的优化策略减少了通信量。此外,我们还引入了用于灵活混合并行的抽象任务层,以解决不同场景下的负载不平衡问题。与 4 节点和 8 节点上的 GPU 基准相比,HySortK 的速度提高了 2-10 倍。最后,我们将 HySortK 集成到现有的基因组组装流水线中,并实现了高达 1.8 倍的速度提升,证明了它在现实世界场景中的灵活性和实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism
In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA subsequences, is a fundamental step in various bioinformatics pipelines, including genome assembly and protein prediction. Due to the growing volume of data, the scaling of the counting process is critical. In the literature, distributed memory software uses hash tables, which exhibit poor cache friendliness and consume excessive memory. They often also lack support for flexible parallelism, which makes integration into existing bioinformatics pipelines difficult. In this work, we propose HySortK, a highly efficient sorting-based distributed memory k-mer counter. HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. Furthermore, we introduce an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes. Finally, we integrated HySortK into an existing genome assembly pipeline and achieved up to 1.8x speedup, proving its flexibility and practicality in real-world scenarios.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信