参考基因组序列压缩与低内存消耗

International Conference on Signal Processing Systems Pub Date : 2022-05-01 DOI:10.1117/12.2631583

Zhiwen Lu, Jianhua Chen, Rongshu Wang

{"title":"参考基因组序列压缩与低内存消耗","authors":"Zhiwen Lu, Jianhua Chen, Rongshu Wang","doi":"10.1117/12.2631583","DOIUrl":null,"url":null,"abstract":"With the rapid development of genome sequencing technology, a large amount of genome data has been generated, it also brings the storage problem of this massive data. Therefore, the compression of genome data has become a research hotspot. We propose a new genome data compression algorithm called LCMRGC (low memory consumption referential genome compressor) for FASTA format sequences. The algorithm uses the suffix array data structure to support the search of matching strings, and uses the binary search method to accelerate accurate matching, so as to obtain better compression ratio. Experiment results on standard genome data show that the proposed algorithm significantly reduces the memory requirement for program operation, and is competitive in compression ratio and compression time.","PeriodicalId":415097,"journal":{"name":"International Conference on Signal Processing Systems","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Referential genome sequence compression with low memory consumption\",\"authors\":\"Zhiwen Lu, Jianhua Chen, Rongshu Wang\",\"doi\":\"10.1117/12.2631583\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the rapid development of genome sequencing technology, a large amount of genome data has been generated, it also brings the storage problem of this massive data. Therefore, the compression of genome data has become a research hotspot. We propose a new genome data compression algorithm called LCMRGC (low memory consumption referential genome compressor) for FASTA format sequences. The algorithm uses the suffix array data structure to support the search of matching strings, and uses the binary search method to accelerate accurate matching, so as to obtain better compression ratio. Experiment results on standard genome data show that the proposed algorithm significantly reduces the memory requirement for program operation, and is competitive in compression ratio and compression time.\",\"PeriodicalId\":415097,\"journal\":{\"name\":\"International Conference on Signal Processing Systems\",\"volume\":\"77 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Signal Processing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2631583\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2631583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着基因组测序技术的飞速发展，产生了大量的基因组数据，同时也带来了海量数据的存储问题。因此，基因组数据的压缩已成为研究热点。针对FASTA格式序列，提出了一种新的基因组数据压缩算法LCMRGC (low memory consumption reference genome compressor)。该算法使用后缀数组数据结构支持匹配字符串的搜索，并使用二进制搜索方法加速精确匹配，从而获得更好的压缩比。在标准基因组数据上的实验结果表明，该算法显著降低了程序运行对内存的需求，在压缩比和压缩时间上具有一定的竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Referential genome sequence compression with low memory consumption

With the rapid development of genome sequencing technology, a large amount of genome data has been generated, it also brings the storage problem of this massive data. Therefore, the compression of genome data has become a research hotspot. We propose a new genome data compression algorithm called LCMRGC (low memory consumption referential genome compressor) for FASTA format sequences. The algorithm uses the suffix array data structure to support the search of matching strings, and uses the binary search method to accelerate accurate matching, so as to obtain better compression ratio. Experiment results on standard genome data show that the proposed algorithm significantly reduces the memory requirement for program operation, and is competitive in compression ratio and compression time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Signal Processing Systems

自引率

0.00%

发文量