Efficient downsampling of genome alignments with Rasusa.

IF 1.2

GigaByte (Hong Kong, China) Pub Date : 2026-04-27 eCollection Date: 2026-01-01 DOI:10.46471/gigabyte.180

Achmad Dimas Cahyaning Furqon, Leah W Roberts, Michael B Hall

引用次数: 0

Abstract

High-throughput sequencing datasets frequently exhibit extreme read depth variation, biasing downstream analysis. Normalising coverage to a specific depth cap is important, yet existing tools rely on computationally expensive fetch-based or non-deterministic greedy algorithms. Here, we present a new coordinate-sorted sweep-line algorithm implemented in the open-source software rasusa that enforces a strict coverage cap at every genomic position. By utilising seeded random priority assignment, we achieve unbiased, reproducible read selection. The algorithm reduces runtimes by over 1,400-fold compared to legacy fetch-based methods-slashing processing from hours to mere seconds-and operates roughly four times faster than VariantBam. Furthermore, it requires only 8 MB of memory for long-read data. This provides a highly efficient, scalable, and reproducible solution for sequencing coverage normalisation.

查看原文本刊更多论文

Rasusa基因组比对的高效下采样。

高通量测序数据集经常表现出极端的读取深度变化，影响下游分析。将覆盖归一化到一个特定的深度上限是很重要的，但是现有的工具依赖于计算成本很高的基于获取的或不确定的贪婪算法。在这里，我们提出了一种新的坐标排序扫描线算法，该算法在开源软件rasusa中实现，该算法在每个基因组位置强制执行严格的覆盖上限。通过利用种子随机优先级分配，我们实现了无偏的、可重复的读选择。与传统的基于获取的方法相比，该算法将运行时间减少了1400多倍（将处理时间从几个小时缩短到几秒钟），运行速度大约是VariantBam的四倍。此外，对于长读数据，它只需要8 MB内存。这为测序覆盖规范化提供了高效、可扩展和可重复的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊