{"title":"基于分布特征分析的Fastq文件压缩算法","authors":"Shengyu Lu, Hanping Chen, Lifa Peng, Beizhan Wang, Hongji Wang, Xiuze Zhou","doi":"10.1109/ICCSE.2018.8468742","DOIUrl":null,"url":null,"abstract":"With the continuous development of sequencing technology scientists in the cost of DNA sequencing in reduce gradually, it also makes the number of DNA sequencing data to increase substantially. While the genome data is need to store, the traditional computer room has not enough to store such large data. Therefore, more and more genome data need to be uploaded to the cloud. Due to the speed of growth of communication have been much faster than the growth of the genomic data, so it is particularly important for genome data compression to reduce the cost of scientific research institutions and it is of great significance to speed up the sharing of genomic data. Fastq file is an important format of genomic data, and now the compression algorithm for fastq files is mainly include of DSRC, FQC, etc. These algorithms are also compressed based on the characteristics of fastq files. In order to improve the rate of compression, we propose an algorithm of DDSRC and establish the statistical models for the distribution characteristics of strings in fastq files to perform more efficient compression algorithms. This paper will explain the algorithm based on the distribution characteristics analysis and compare the results with other compression algorithms.","PeriodicalId":228760,"journal":{"name":"2018 13th International Conference on Computer Science & Education (ICCSE)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Compression Algorithm of Fastq File Based on Distribution Characteristics Analysis\",\"authors\":\"Shengyu Lu, Hanping Chen, Lifa Peng, Beizhan Wang, Hongji Wang, Xiuze Zhou\",\"doi\":\"10.1109/ICCSE.2018.8468742\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the continuous development of sequencing technology scientists in the cost of DNA sequencing in reduce gradually, it also makes the number of DNA sequencing data to increase substantially. While the genome data is need to store, the traditional computer room has not enough to store such large data. Therefore, more and more genome data need to be uploaded to the cloud. Due to the speed of growth of communication have been much faster than the growth of the genomic data, so it is particularly important for genome data compression to reduce the cost of scientific research institutions and it is of great significance to speed up the sharing of genomic data. Fastq file is an important format of genomic data, and now the compression algorithm for fastq files is mainly include of DSRC, FQC, etc. These algorithms are also compressed based on the characteristics of fastq files. In order to improve the rate of compression, we propose an algorithm of DDSRC and establish the statistical models for the distribution characteristics of strings in fastq files to perform more efficient compression algorithms. This paper will explain the algorithm based on the distribution characteristics analysis and compare the results with other compression algorithms.\",\"PeriodicalId\":228760,\"journal\":{\"name\":\"2018 13th International Conference on Computer Science & Education (ICCSE)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 13th International Conference on Computer Science & Education (ICCSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCSE.2018.8468742\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 13th International Conference on Computer Science & Education (ICCSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCSE.2018.8468742","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Compression Algorithm of Fastq File Based on Distribution Characteristics Analysis
With the continuous development of sequencing technology scientists in the cost of DNA sequencing in reduce gradually, it also makes the number of DNA sequencing data to increase substantially. While the genome data is need to store, the traditional computer room has not enough to store such large data. Therefore, more and more genome data need to be uploaded to the cloud. Due to the speed of growth of communication have been much faster than the growth of the genomic data, so it is particularly important for genome data compression to reduce the cost of scientific research institutions and it is of great significance to speed up the sharing of genomic data. Fastq file is an important format of genomic data, and now the compression algorithm for fastq files is mainly include of DSRC, FQC, etc. These algorithms are also compressed based on the characteristics of fastq files. In order to improve the rate of compression, we propose an algorithm of DDSRC and establish the statistical models for the distribution characteristics of strings in fastq files to perform more efficient compression algorithms. This paper will explain the algorithm based on the distribution characteristics analysis and compare the results with other compression algorithms.