High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

arXiv - QuanBio - Genomics Pub Date : 2024-07-10 DOI:arxiv-2407.07718

Yifan Li, Giulia Guidi

{"title":"High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism","authors":"Yifan Li, Giulia Guidi","doi":"arxiv-2407.07718","DOIUrl":null,"url":null,"abstract":"In generating large quantities of DNA data, high-throughput sequencing\ntechnologies require advanced bioinformatics infrastructures for efficient data\nanalysis. k-mer counting, the process of quantifying the frequency of\nfixed-length k DNA subsequences, is a fundamental step in various\nbioinformatics pipelines, including genome assembly and protein prediction. Due\nto the growing volume of data, the scaling of the counting process is critical.\nIn the literature, distributed memory software uses hash tables, which exhibit\npoor cache friendliness and consume excessive memory. They often also lack\nsupport for flexible parallelism, which makes integration into existing\nbioinformatics pipelines difficult. In this work, we propose HySortK, a highly\nefficient sorting-based distributed memory k-mer counter. HySortK reduces the\ncommunication volume through a carefully designed communication scheme and\ndomain-specific optimization strategies. Furthermore, we introduce an abstract\ntask layer for flexible hybrid parallelism to address load imbalances in\ndifferent scenarios. HySortK achieves a 2-10x speedup compared to the GPU\nbaseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK\nachieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.\nFinally, we integrated HySortK into an existing genome assembly pipeline and\nachieved up to 1.8x speedup, proving its flexibility and practicality in\nreal-world scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.07718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA subsequences, is a fundamental step in various bioinformatics pipelines, including genome assembly and protein prediction. Due to the growing volume of data, the scaling of the counting process is critical. In the literature, distributed memory software uses hash tables, which exhibit poor cache friendliness and consume excessive memory. They often also lack support for flexible parallelism, which makes integration into existing bioinformatics pipelines difficult. In this work, we propose HySortK, a highly efficient sorting-based distributed memory k-mer counter. HySortK reduces the communication volume through a carefully designed communication scheme and domain-specific optimization strategies. Furthermore, we introduce an abstract task layer for flexible hybrid parallelism to address load imbalances in different scenarios. HySortK achieves a 2-10x speedup compared to the GPU baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes. Finally, we integrated HySortK into an existing genome assembly pipeline and achieved up to 1.8x speedup, proving its flexibility and practicality in real-world scenarios.

查看原文本刊更多论文

分布式存储器中基于排序的高性能 k-mer 计数与灵活的混合并行性

在生成大量 DNA 数据的过程中，高通量测序技术需要先进的生物信息学基础设施来进行高效的数据分析。k-mer 计数是量化固定长度 k DNA 子序列频率的过程，是基因组组装和蛋白质预测等各种生物信息学流水线的基本步骤。随着数据量的不断增长，计数过程的扩展至关重要。它们通常还缺乏对灵活并行性的支持，因此很难集成到现有的生物信息学流水线中。在这项工作中，我们提出了基于高效排序的分布式内存 k-mer 计数器 HySortK。HySortK 通过精心设计的通信方案和特定领域的优化策略减少了通信量。此外，我们还引入了用于灵活混合并行的抽象任务层，以解决不同场景下的负载不平衡问题。与 4 节点和 8 节点上的 GPU 基准相比，HySortK 的速度提高了 2-10 倍。最后，我们将 HySortK 集成到现有的基因组组装流水线中，并实现了高达 1.8 倍的速度提升，证明了它在现实世界场景中的灵活性和实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量