基于参考的云基因组数据压缩

Haixiang Shi, Yongqing Zhu, J. Samsudin
{"title":"基于参考的云基因组数据压缩","authors":"Haixiang Shi, Yongqing Zhu, J. Samsudin","doi":"10.1145/3018009.3018030","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a new reference-based data compression method for efficient compressing of genome sequencing data in FASTQ format. With the advance of the next sequencing technology, the genome data can be generated faster and cheaper, which brings the challenges for efficient storage of these data when used in cloud computing. In order to efficiently store these types of genome data in cloud, content-aware compressing methods have to be developed to make use of the specific file structures. Compared with existing genome-specific compression methods, our proposed content-aware method focused on high compression ratio by taking advantages of repetitive nature of DNA sequence, and using reference genomes in compressing the sequences inside the FASTQ files. The benchmark results of 8 datasets show that our method can achieve highest compression ratio compared with existing FASTQ file compressors.","PeriodicalId":189252,"journal":{"name":"Proceedings of the 2nd International Conference on Communication and Information Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Reference-based data compression for genome in cloud\",\"authors\":\"Haixiang Shi, Yongqing Zhu, J. Samsudin\",\"doi\":\"10.1145/3018009.3018030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a new reference-based data compression method for efficient compressing of genome sequencing data in FASTQ format. With the advance of the next sequencing technology, the genome data can be generated faster and cheaper, which brings the challenges for efficient storage of these data when used in cloud computing. In order to efficiently store these types of genome data in cloud, content-aware compressing methods have to be developed to make use of the specific file structures. Compared with existing genome-specific compression methods, our proposed content-aware method focused on high compression ratio by taking advantages of repetitive nature of DNA sequence, and using reference genomes in compressing the sequences inside the FASTQ files. The benchmark results of 8 datasets show that our method can achieve highest compression ratio compared with existing FASTQ file compressors.\",\"PeriodicalId\":189252,\"journal\":{\"name\":\"Proceedings of the 2nd International Conference on Communication and Information Processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2nd International Conference on Communication and Information Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3018009.3018030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Conference on Communication and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3018009.3018030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

本文提出了一种新的基于参考的数据压缩方法,可以有效地压缩FASTQ格式的基因组测序数据。随着下一代测序技术的进步,基因组数据的生成速度更快,成本更低,这就给在云计算中使用这些数据时的高效存储带来了挑战。为了在云中有效地存储这些类型的基因组数据,必须开发内容感知压缩方法来利用特定的文件结构。与现有的基因组特异性压缩方法相比,我们提出的内容感知方法利用DNA序列的重复性,利用参考基因组压缩FASTQ文件内的序列,实现了高压缩比。8个数据集的基准测试结果表明,与现有的FASTQ文件压缩器相比,我们的方法可以实现最高的压缩比。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Reference-based data compression for genome in cloud
In this paper, we propose a new reference-based data compression method for efficient compressing of genome sequencing data in FASTQ format. With the advance of the next sequencing technology, the genome data can be generated faster and cheaper, which brings the challenges for efficient storage of these data when used in cloud computing. In order to efficiently store these types of genome data in cloud, content-aware compressing methods have to be developed to make use of the specific file structures. Compared with existing genome-specific compression methods, our proposed content-aware method focused on high compression ratio by taking advantages of repetitive nature of DNA sequence, and using reference genomes in compressing the sequences inside the FASTQ files. The benchmark results of 8 datasets show that our method can achieve highest compression ratio compared with existing FASTQ file compressors.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信