Efficient downsampling of genome alignments with Rasusa.

IF 1.2
GigaByte (Hong Kong, China) Pub Date : 2026-04-27 eCollection Date: 2026-01-01 DOI:10.46471/gigabyte.180
Achmad Dimas Cahyaning Furqon, Leah W Roberts, Michael B Hall
{"title":"Efficient downsampling of genome alignments with Rasusa.","authors":"Achmad Dimas Cahyaning Furqon, Leah W Roberts, Michael B Hall","doi":"10.46471/gigabyte.180","DOIUrl":null,"url":null,"abstract":"<p><p>High-throughput sequencing datasets frequently exhibit extreme read depth variation, biasing downstream analysis. Normalising coverage to a specific depth cap is important, yet existing tools rely on computationally expensive fetch-based or non-deterministic greedy algorithms. Here, we present a new coordinate-sorted sweep-line algorithm implemented in the open-source software rasusa that enforces a strict coverage cap at every genomic position. By utilising seeded random priority assignment, we achieve unbiased, reproducible read selection. The algorithm reduces runtimes by over 1,400-fold compared to legacy fetch-based methods-slashing processing from hours to mere seconds-and operates roughly four times faster than VariantBam. Furthermore, it requires only 8 MB of memory for long-read data. This provides a highly efficient, scalable, and reproducible solution for sequencing coverage normalisation.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2026 ","pages":"gigabyte180"},"PeriodicalIF":1.2000,"publicationDate":"2026-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13141811/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaByte (Hong Kong, China)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46471/gigabyte.180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

High-throughput sequencing datasets frequently exhibit extreme read depth variation, biasing downstream analysis. Normalising coverage to a specific depth cap is important, yet existing tools rely on computationally expensive fetch-based or non-deterministic greedy algorithms. Here, we present a new coordinate-sorted sweep-line algorithm implemented in the open-source software rasusa that enforces a strict coverage cap at every genomic position. By utilising seeded random priority assignment, we achieve unbiased, reproducible read selection. The algorithm reduces runtimes by over 1,400-fold compared to legacy fetch-based methods-slashing processing from hours to mere seconds-and operates roughly four times faster than VariantBam. Furthermore, it requires only 8 MB of memory for long-read data. This provides a highly efficient, scalable, and reproducible solution for sequencing coverage normalisation.

Rasusa基因组比对的高效下采样。
高通量测序数据集经常表现出极端的读取深度变化,影响下游分析。将覆盖归一化到一个特定的深度上限是很重要的,但是现有的工具依赖于计算成本很高的基于获取的或不确定的贪婪算法。在这里,我们提出了一种新的坐标排序扫描线算法,该算法在开源软件rasusa中实现,该算法在每个基因组位置强制执行严格的覆盖上限。通过利用种子随机优先级分配,我们实现了无偏的、可重复的读选择。与传统的基于获取的方法相比,该算法将运行时间减少了1400多倍(将处理时间从几个小时缩短到几秒钟),运行速度大约是VariantBam的四倍。此外,对于长读数据,它只需要8 MB内存。这为测序覆盖规范化提供了高效、可扩展和可重复的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.60
自引率
0.00%
发文量
0
审稿时长
5 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书