{"title":"High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism","authors":"Yifan Li, Giulia Guidi","doi":"arxiv-2407.07718","DOIUrl":null,"url":null,"abstract":"In generating large quantities of DNA data, high-throughput sequencing\ntechnologies require advanced bioinformatics infrastructures for efficient data\nanalysis. k-mer counting, the process of quantifying the frequency of\nfixed-length k DNA subsequences, is a fundamental step in various\nbioinformatics pipelines, including genome assembly and protein prediction. Due\nto the growing volume of data, the scaling of the counting process is critical.\nIn the literature, distributed memory software uses hash tables, which exhibit\npoor cache friendliness and consume excessive memory. They often also lack\nsupport for flexible parallelism, which makes integration into existing\nbioinformatics pipelines difficult. In this work, we propose HySortK, a highly\nefficient sorting-based distributed memory k-mer counter. HySortK reduces the\ncommunication volume through a carefully designed communication scheme and\ndomain-specific optimization strategies. Furthermore, we introduce an abstract\ntask layer for flexible hybrid parallelism to address load imbalances in\ndifferent scenarios. HySortK achieves a 2-10x speedup compared to the GPU\nbaseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK\nachieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.\nFinally, we integrated HySortK into an existing genome assembly pipeline and\nachieved up to 1.8x speedup, proving its flexibility and practicality in\nreal-world scenarios.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.07718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In generating large quantities of DNA data, high-throughput sequencing
technologies require advanced bioinformatics infrastructures for efficient data
analysis. k-mer counting, the process of quantifying the frequency of
fixed-length k DNA subsequences, is a fundamental step in various
bioinformatics pipelines, including genome assembly and protein prediction. Due
to the growing volume of data, the scaling of the counting process is critical.
In the literature, distributed memory software uses hash tables, which exhibit
poor cache friendliness and consume excessive memory. They often also lack
support for flexible parallelism, which makes integration into existing
bioinformatics pipelines difficult. In this work, we propose HySortK, a highly
efficient sorting-based distributed memory k-mer counter. HySortK reduces the
communication volume through a carefully designed communication scheme and
domain-specific optimization strategies. Furthermore, we introduce an abstract
task layer for flexible hybrid parallelism to address load imbalances in
different scenarios. HySortK achieves a 2-10x speedup compared to the GPU
baseline on 4 and 8 nodes. Compared to state-of-the-art CPU software, HySortK
achieves up to 2x speedup while reducing peak memory usage by 30% on 16 nodes.
Finally, we integrated HySortK into an existing genome assembly pipeline and
achieved up to 1.8x speedup, proving its flexibility and practicality in
real-world scenarios.