AliCo: A New Efficient Representation for SAM Files

Idoia Ochoa, Hongyi Li, Florian Baumgarte, C. Hergenrother, Jan Voges, M. Hernaez
{"title":"AliCo: A New Efficient Representation for SAM Files","authors":"Idoia Ochoa, Hongyi Li, Florian Baumgarte, C. Hergenrother, Jan Voges, M. Hernaez","doi":"10.1109/DCC.2019.00017","DOIUrl":null,"url":null,"abstract":"As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings. AliCo can be accessed at: https://github.com/iochoa/alico","PeriodicalId":167723,"journal":{"name":"2019 Data Compression Conference (DCC)","volume":"545 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Data Compression Conference (DCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2019.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings. AliCo can be accessed at: https://github.com/iochoa/alico
AliCo:一种新的有效的SAM文件表示
随着基因组测序的成本效益和可负担性不断提高,预计在未来几年将产生更多的原始和对齐的基因组文件。此外,由于测序机吞吐量的增加,这些文件的大小也在显著增长。特别是,用于进一步处理数据的对齐文件(例如SAM/BAM),因此迫切需要有效地表示这些文件。在这项工作中,我们提出了AliCo,一种针对以SAM格式表示的对齐数据量身定制的新压缩方法。我们通过对现有数据集的模拟证明,AliCo在压缩比方面优于最先进的SAM文件压缩器,在无损模式下运行时,其大小减少了85%以上。AliCo还支持对质量分数进行有损压缩的各种模式,包括最近首次提出的有损压缩器CALQ,它使用来自对齐读取的信息来调整基因组每个位置的量化水平(在高覆盖率数据集中实现超过10倍的压缩增益)。AliCo还支持用于压缩的引用序列的可选压缩,从而保证压缩数据的精确重建。最后,AliCo允许在数据被压缩时对其进行流处理,也可以在数据被接收时对其进行解压缩,从而潜在地节省大量时间。AliCo的网址是:https://github.com/iochoa/alico
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信