{"title":"Study on reference-based FASTQ genome sequences compression","authors":"Wenlong Li, Jianhua Chen, Zhiwen Lu","doi":"10.1145/3523286.3524511","DOIUrl":null,"url":null,"abstract":"As the cost of genome sequencing decreases, the large amount of genomic data generated brings the storage problem of this massive data. We still have a lot of work to do in the field of specialized data compression of FASTQ files. This paper aims to explore a reference-based lossless compression algorithm for genome sequences in FASTQ format. We propose a compression scheme based on longest matching by using FMD-index to support exact match searching. At the same time, the reverse complementary sequence is used and the insertion, deletion and replacement operations are described effectively to further improve the compression ratio. In comparison with the experimental results of five compressors on seven sets of genome data, the proposed algorithm significantly improves the FASTQ file compression ratios, and is competitive in running time.","PeriodicalId":268165,"journal":{"name":"2022 2nd International Conference on Bioinformatics and Intelligent Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 2nd International Conference on Bioinformatics and Intelligent Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3523286.3524511","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
As the cost of genome sequencing decreases, the large amount of genomic data generated brings the storage problem of this massive data. We still have a lot of work to do in the field of specialized data compression of FASTQ files. This paper aims to explore a reference-based lossless compression algorithm for genome sequences in FASTQ format. We propose a compression scheme based on longest matching by using FMD-index to support exact match searching. At the same time, the reverse complementary sequence is used and the insertion, deletion and replacement operations are described effectively to further improve the compression ratio. In comparison with the experimental results of five compressors on seven sets of genome data, the proposed algorithm significantly improves the FASTQ file compression ratios, and is competitive in running time.