Seunghye Han, Wonseok Choi, Rayan Muwafiq, Yunmook Nah
{"title":"基于Hadoop和Spark的内存大小对大数据处理的影响","authors":"Seunghye Han, Wonseok Choi, Rayan Muwafiq, Yunmook Nah","doi":"10.1145/3129676.3129688","DOIUrl":null,"url":null,"abstract":"Hadoop and Spark are well-known big data processing platforms. The main technologies of Hadoop are Hadoop Distributed File System and MapReduce processing. Hadoop stores intermediary data on Hadoop Distributed File System, which is a disk-based distributed file system, while Spark stores intermediary data in the memories of distributed computing nodes as Resilient Distributed Dataset. In this paper, we show how memory size affects distributed processing of large volume of data, by comparing the running time of K-means algorithm of HiBench benchmark on Hadoop and Spark clusters, with different size of memories allocated to data nodes. Our results show that Spark cluster is faster than Hadoop cluster as long as the memory size is big enough for the data size. But, with the increase of the data size, Hadoop cluster outperforms Spark cluster. When data size is bigger than memory cache, Spark has to replace disk data with memory cached data, and this situation causes performance degradation.","PeriodicalId":326100,"journal":{"name":"Proceedings of the International Conference on Research in Adaptive and Convergent Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Impact of Memory Size on Bigdata Processing based on Hadoop and Spark\",\"authors\":\"Seunghye Han, Wonseok Choi, Rayan Muwafiq, Yunmook Nah\",\"doi\":\"10.1145/3129676.3129688\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hadoop and Spark are well-known big data processing platforms. The main technologies of Hadoop are Hadoop Distributed File System and MapReduce processing. Hadoop stores intermediary data on Hadoop Distributed File System, which is a disk-based distributed file system, while Spark stores intermediary data in the memories of distributed computing nodes as Resilient Distributed Dataset. In this paper, we show how memory size affects distributed processing of large volume of data, by comparing the running time of K-means algorithm of HiBench benchmark on Hadoop and Spark clusters, with different size of memories allocated to data nodes. Our results show that Spark cluster is faster than Hadoop cluster as long as the memory size is big enough for the data size. But, with the increase of the data size, Hadoop cluster outperforms Spark cluster. When data size is bigger than memory cache, Spark has to replace disk data with memory cached data, and this situation causes performance degradation.\",\"PeriodicalId\":326100,\"journal\":{\"name\":\"Proceedings of the International Conference on Research in Adaptive and Convergent Systems\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on Research in Adaptive and Convergent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3129676.3129688\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Research in Adaptive and Convergent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3129676.3129688","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Impact of Memory Size on Bigdata Processing based on Hadoop and Spark
Hadoop and Spark are well-known big data processing platforms. The main technologies of Hadoop are Hadoop Distributed File System and MapReduce processing. Hadoop stores intermediary data on Hadoop Distributed File System, which is a disk-based distributed file system, while Spark stores intermediary data in the memories of distributed computing nodes as Resilient Distributed Dataset. In this paper, we show how memory size affects distributed processing of large volume of data, by comparing the running time of K-means algorithm of HiBench benchmark on Hadoop and Spark clusters, with different size of memories allocated to data nodes. Our results show that Spark cluster is faster than Hadoop cluster as long as the memory size is big enough for the data size. But, with the increase of the data size, Hadoop cluster outperforms Spark cluster. When data size is bigger than memory cache, Spark has to replace disk data with memory cached data, and this situation causes performance degradation.