R. Shyamasundar, Swatish Satheesan, Deepali Mittal, Aakash Chaudhary
{"title":"SecHadoop: A Privacy Preserving Hadoop","authors":"R. Shyamasundar, Swatish Satheesan, Deepali Mittal, Aakash Chaudhary","doi":"10.1145/3344341.3368819","DOIUrl":null,"url":null,"abstract":"With the generation of vast amounts of data, there has been a tremendous need for processing the same in an economical way. MapReduce paradigm provides an economical processing of huge datasets in an effective way. Hadoop is a framework for managing huge amounts of data, and facilitates parallel computations on data using commodity hardware, through an integration of MapReduce paradigm with the HDFS file system. Due to intrinsic data divisions during parallel processing, there is a possibility of data leaks. Thus, in the context of Hadoop, if processing has to keep the privacy invariant over the computation, it is necessary to guarantee privacy not only of the MapReduce process but also assure that the HDFS file system does leak any information. The focus of our work is on data security and privacy in such cloud environments. Our main thrust is to preserve data confidentiality and privacy as per specifications notwithstanding data divisions or scheduling for fault tolerance. We realise privacy invariance on Hadoop by monitoring the information flow from subjects to objects created in Hadoop using the readers writers flow model (RWFM). In this paper, we describe the design, implementation and performance of a security enhanced Hadoop, called SecHadoop. We illustrate our approach with various case studies corresponding to infection of map/reduce tasks, failure of nodes etc., and demonstrate how end-to-end security of programs is realised. It is further shown that the overall overhead is less than 5% on single/multi-node setup.","PeriodicalId":261870,"journal":{"name":"Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3344341.3368819","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
With the generation of vast amounts of data, there has been a tremendous need for processing the same in an economical way. MapReduce paradigm provides an economical processing of huge datasets in an effective way. Hadoop is a framework for managing huge amounts of data, and facilitates parallel computations on data using commodity hardware, through an integration of MapReduce paradigm with the HDFS file system. Due to intrinsic data divisions during parallel processing, there is a possibility of data leaks. Thus, in the context of Hadoop, if processing has to keep the privacy invariant over the computation, it is necessary to guarantee privacy not only of the MapReduce process but also assure that the HDFS file system does leak any information. The focus of our work is on data security and privacy in such cloud environments. Our main thrust is to preserve data confidentiality and privacy as per specifications notwithstanding data divisions or scheduling for fault tolerance. We realise privacy invariance on Hadoop by monitoring the information flow from subjects to objects created in Hadoop using the readers writers flow model (RWFM). In this paper, we describe the design, implementation and performance of a security enhanced Hadoop, called SecHadoop. We illustrate our approach with various case studies corresponding to infection of map/reduce tasks, failure of nodes etc., and demonstrate how end-to-end security of programs is realised. It is further shown that the overall overhead is less than 5% on single/multi-node setup.