{"title":"Bucket based data deduplication technique for big data storage system","authors":"N. Kumar, R. Rawat, S. C. Jain","doi":"10.1109/ICRITO.2016.7784963","DOIUrl":null,"url":null,"abstract":"In this paper proposed bucket based data deduplication technique is presented. In proposed technique bigdata stream is given to the fixed size chunking algorithm to create fixed size chunks. When the chunks are obtained then these chunks are given to the MD5 algorithm module to generate hash values for the chunks. After that MapReduce model is applied to find whether hash values are duplicate or not. To detect the duplicate hash values MapReduce model compared these hash values with already stored hash values in bucket storage. If these hash values are already present in the bucket storage then these can be identified as duplicate. If the hash values are duplicated then do not store the data into the Hadoop Distributed File System (HDFS) else then store the data into the HDFS. The proposed technique is analyzed using real data set using Hadoop tool.","PeriodicalId":377611,"journal":{"name":"2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)","volume":"336 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICRITO.2016.7784963","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
In this paper proposed bucket based data deduplication technique is presented. In proposed technique bigdata stream is given to the fixed size chunking algorithm to create fixed size chunks. When the chunks are obtained then these chunks are given to the MD5 algorithm module to generate hash values for the chunks. After that MapReduce model is applied to find whether hash values are duplicate or not. To detect the duplicate hash values MapReduce model compared these hash values with already stored hash values in bucket storage. If these hash values are already present in the bucket storage then these can be identified as duplicate. If the hash values are duplicated then do not store the data into the Hadoop Distributed File System (HDFS) else then store the data into the HDFS. The proposed technique is analyzed using real data set using Hadoop tool.