OReO:为实际压缩优化读顺序。

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Bioinformatics advances Pub Date : 2025-06-03 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf128
Mathilde Girard, Léa Vandamme, Bastien Cazaux, Antoine Limasset
{"title":"OReO:为实际压缩优化读顺序。","authors":"Mathilde Girard, Léa Vandamme, Bastien Cazaux, Antoine Limasset","doi":"10.1093/bioadv/vbaf128","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Recent advances in high-throughput and third-generation sequencing technologies have created significant challenges in storing and managing the rapidly growing volume of read datasets. Although more than 50 specialized compression tools have been developed, employing methods such as reference-based approaches, customized generic compressors, and read reordering, many users still rely on common generic compressors (e.g. gzip, zstd, xz) for convenience, portability, and reliability, despite their low compression ratios. Here, we introduce Optimizing Read Order (OReO), a simple read-reordering framework that achieves high compression performance without requiring specialized software for decompression. By grouping overlapping reads together before applying generic compressors, OReO exploits inherent redundancies in sequencing data and achieves compression ratios on par with state-of-the-art tools. Moreover, because it relies only on standard decompressors, OReO avoids the need for dedicated installations and maintenance, removing a key barrier to practical adoption.</p><p><strong>Results: </strong>We evaluated OReO on both Oxford Nanopore Technologies (ONT) and HiFi genomic and metagenomic datasets of varying sizes and complexities. Our results demonstrate that OReO provides substantial compression gains with comparable resource usage and outperforms dedicated methods in decompression speed. We propose that future compression strategies should focus on reordering as a means to let generic compression tools fully exploit data redundancy, offering an efficient, sustainable, and user-friendly solution to the growing challenges of sequencing data storage.</p><p><strong>Availability and implementation: </strong>The OReO code is open source and available at github.com/girunivlille/oreo.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf128"},"PeriodicalIF":2.4000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12185860/pdf/","citationCount":"0","resultStr":"{\"title\":\"OReO: optimizing read order for practical compression.\",\"authors\":\"Mathilde Girard, Léa Vandamme, Bastien Cazaux, Antoine Limasset\",\"doi\":\"10.1093/bioadv/vbaf128\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Recent advances in high-throughput and third-generation sequencing technologies have created significant challenges in storing and managing the rapidly growing volume of read datasets. Although more than 50 specialized compression tools have been developed, employing methods such as reference-based approaches, customized generic compressors, and read reordering, many users still rely on common generic compressors (e.g. gzip, zstd, xz) for convenience, portability, and reliability, despite their low compression ratios. Here, we introduce Optimizing Read Order (OReO), a simple read-reordering framework that achieves high compression performance without requiring specialized software for decompression. By grouping overlapping reads together before applying generic compressors, OReO exploits inherent redundancies in sequencing data and achieves compression ratios on par with state-of-the-art tools. Moreover, because it relies only on standard decompressors, OReO avoids the need for dedicated installations and maintenance, removing a key barrier to practical adoption.</p><p><strong>Results: </strong>We evaluated OReO on both Oxford Nanopore Technologies (ONT) and HiFi genomic and metagenomic datasets of varying sizes and complexities. Our results demonstrate that OReO provides substantial compression gains with comparable resource usage and outperforms dedicated methods in decompression speed. We propose that future compression strategies should focus on reordering as a means to let generic compression tools fully exploit data redundancy, offering an efficient, sustainable, and user-friendly solution to the growing challenges of sequencing data storage.</p><p><strong>Availability and implementation: </strong>The OReO code is open source and available at github.com/girunivlille/oreo.</p>\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"5 1\",\"pages\":\"vbaf128\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-06-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12185860/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbaf128\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

动机:高通量和第三代测序技术的最新进展为存储和管理快速增长的读取数据集带来了重大挑战。尽管已经开发了50多种专门的压缩工具,采用了诸如基于引用的方法、定制的通用压缩器和读取重排序等方法,但许多用户仍然依赖通用压缩器(例如gzip、zstd、xz)来获得便利性、可移植性和可靠性,尽管它们的压缩比很低。在这里,我们介绍优化读顺序(OReO),这是一个简单的读重排序框架,无需专门的软件即可实现高压缩性能。通过在应用通用压缩器之前将重叠的读取分组在一起,OReO利用了测序数据中的固有冗余,并实现了与最先进工具相当的压缩比。此外,由于它只依赖于标准的减压器,OReO避免了专用安装和维护的需要,消除了实际采用的关键障碍。结果:我们在不同大小和复杂性的Oxford Nanopore Technologies (ONT)和HiFi基因组和宏基因组数据集上评估了OReO。我们的结果表明,OReO在相当的资源使用情况下提供了可观的压缩收益,并且在解压速度上优于专用方法。我们建议未来的压缩策略应侧重于重新排序,以使通用压缩工具充分利用数据冗余,为测序数据存储日益增长的挑战提供高效、可持续和用户友好的解决方案。可用性和实现:OReO代码是开源的,可以在github.com/girunivlille/oreo上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
OReO: optimizing read order for practical compression.

Motivation: Recent advances in high-throughput and third-generation sequencing technologies have created significant challenges in storing and managing the rapidly growing volume of read datasets. Although more than 50 specialized compression tools have been developed, employing methods such as reference-based approaches, customized generic compressors, and read reordering, many users still rely on common generic compressors (e.g. gzip, zstd, xz) for convenience, portability, and reliability, despite their low compression ratios. Here, we introduce Optimizing Read Order (OReO), a simple read-reordering framework that achieves high compression performance without requiring specialized software for decompression. By grouping overlapping reads together before applying generic compressors, OReO exploits inherent redundancies in sequencing data and achieves compression ratios on par with state-of-the-art tools. Moreover, because it relies only on standard decompressors, OReO avoids the need for dedicated installations and maintenance, removing a key barrier to practical adoption.

Results: We evaluated OReO on both Oxford Nanopore Technologies (ONT) and HiFi genomic and metagenomic datasets of varying sizes and complexities. Our results demonstrate that OReO provides substantial compression gains with comparable resource usage and outperforms dedicated methods in decompression speed. We propose that future compression strategies should focus on reordering as a means to let generic compression tools fully exploit data redundancy, offering an efficient, sustainable, and user-friendly solution to the growing challenges of sequencing data storage.

Availability and implementation: The OReO code is open source and available at github.com/girunivlille/oreo.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
1.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信