GDedup:用于基因组大数据的分布式文件系统级重复数据删除

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI:10.1109/BigDataCongress.2018.00023

Paul Bartus, Emmanuel Arzuaga

{"title":"GDedup:用于基因组大数据的分布式文件系统级重复数据删除","authors":"Paul Bartus, Emmanuel Arzuaga","doi":"10.1109/BigDataCongress.2018.00023","DOIUrl":null,"url":null,"abstract":"During the last years, the cost of sequencing has dropped, and the amount of generated genomic sequence data has skyrocketed. As a consequence, genomic sequence data have become more expensive to store than to generate. The storage needs for genomic sequence data are also following this trend. In order to solve these new storage needs, different compression algorithms have been used. Nevertheless, typical compression ratios for genomic data range between 3 and 10. In this paper, we propose the use of GDedup, a deduplication storage system for genomics data, in order to improve data storage capacity and efficiency in distributed file systems without compromising I/O performance. GDedup can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy in genomic sequence data and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume. We present a study on the relation between the amount of different types of mutations in genomic data such as point mutations, substitutions, inversions, and the effect of such in the deduplication ratio for a data set of vertebrate genomes in FASTA format. The experimental results show that the deduplication ratio values are superior to the actual compression ratio values for both (file read-decompress or write-compress) I/O patterns, highlighting the potential for this technology to be effectively adapted to improve storage management of genomics data.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"GDedup: Distributed File System Level Deduplication for Genomic Big Data\",\"authors\":\"Paul Bartus, Emmanuel Arzuaga\",\"doi\":\"10.1109/BigDataCongress.2018.00023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"During the last years, the cost of sequencing has dropped, and the amount of generated genomic sequence data has skyrocketed. As a consequence, genomic sequence data have become more expensive to store than to generate. The storage needs for genomic sequence data are also following this trend. In order to solve these new storage needs, different compression algorithms have been used. Nevertheless, typical compression ratios for genomic data range between 3 and 10. In this paper, we propose the use of GDedup, a deduplication storage system for genomics data, in order to improve data storage capacity and efficiency in distributed file systems without compromising I/O performance. GDedup can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy in genomic sequence data and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume. We present a study on the relation between the amount of different types of mutations in genomic data such as point mutations, substitutions, inversions, and the effect of such in the deduplication ratio for a data set of vertebrate genomes in FASTA format. The experimental results show that the deduplication ratio values are superior to the actual compression ratio values for both (file read-decompress or write-compress) I/O patterns, highlighting the potential for this technology to be effectively adapted to improve storage management of genomics data.\",\"PeriodicalId\":177250,\"journal\":{\"name\":\"2018 IEEE International Congress on Big Data (BigData Congress)\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Congress on Big Data (BigData Congress)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BigDataCongress.2018.00023\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2018.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

在过去的几年里，测序的成本下降了，而产生的基因组序列数据的数量却直线上升。因此，基因组序列数据的存储成本比生成成本更高。基因组序列数据的存储需求也遵循这一趋势。为了解决这些新的存储需求，使用了不同的压缩算法。然而，基因组数据的典型压缩比在3到10之间。在本文中，我们建议使用基因组数据的重复数据删除存储系统GDedup，以提高分布式文件系统的数据存储容量和效率，同时不影响I/O性能。GDedup可以通过修改现有的存储系统环境(如Hadoop分布式文件系统)来开发。通过利用重复数据删除技术，我们可以更好地管理基因组序列数据中的潜在冗余，并减少在文件系统中存储这些文件所需的空间，从而允许每个卷具有更大的容量。我们研究了基因组数据中不同类型的突变(如点突变、替换、反转)的数量之间的关系，以及这些突变对FASTA格式脊椎动物基因组数据集的重复数据删除率的影响。实验结果表明，对于两种I/O模式(文件读-解压缩或写压缩)，重复数据删除比率值都优于实际压缩比率值，突出了该技术有效应用于改进基因组数据存储管理的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GDedup: Distributed File System Level Deduplication for Genomic Big Data

During the last years, the cost of sequencing has dropped, and the amount of generated genomic sequence data has skyrocketed. As a consequence, genomic sequence data have become more expensive to store than to generate. The storage needs for genomic sequence data are also following this trend. In order to solve these new storage needs, different compression algorithms have been used. Nevertheless, typical compression ratios for genomic data range between 3 and 10. In this paper, we propose the use of GDedup, a deduplication storage system for genomics data, in order to improve data storage capacity and efficiency in distributed file systems without compromising I/O performance. GDedup can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy in genomic sequence data and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume. We present a study on the relation between the amount of different types of mutations in genomic data such as point mutations, substitutions, inversions, and the effect of such in the deduplication ratio for a data set of vertebrate genomes in FASTA format. The experimental results show that the deduplication ratio values are superior to the actual compression ratio values for both (file read-decompress or write-compress) I/O patterns, highlighting the potential for this technology to be effectively adapted to improve storage management of genomics data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE International Congress on Big Data (BigData Congress)

自引率

0.00%

发文量