{"title":"Dacoop: Accelerating Data-Iterative Applications on Map/Reduce Cluster","authors":"Yi Liang, Guangrui Li, Lei Wang, Yanpeng Hu","doi":"10.1109/PDCAT.2011.32","DOIUrl":null,"url":null,"abstract":"Map/reduce is a popular parallel processing framework for massive-scale data-intensive computing. The data-iterative application is composed of a serials of map/reduce jobs and need to repeatedly process some data files among these jobs. The existing implementation of map/reduce framework focus on perform data processing in a single pass with one map/reduce job and do not directly support the data-iterative applications, particularly in term of the explicit specification of the repeatedly processed data among jobs. In this paper, we propose an extended version of Hadoop map/reduce framework called Dacoop. Dacoop extends Map/Reduce programming interface to specify the repeatedly processed data, introduces the shared memory-based data cache mechanism to cache the data since its first access, and adopts the caching-aware task scheduling so that the cached data can be shared among the map/reduce jobs of data-iterative applications. We evaluate Dacoop on two typical data-iterative applications: k-means clustering and the domain rule reasoning in sementic web, with real and synthetic datasets. Experimental results show that the data-iterative applications can gain better performance on Dacoop than that on Hadoop. The turnaround time of a data-iterative application can be reduced by the maximum of 15.1%.","PeriodicalId":137617,"journal":{"name":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2011.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Map/reduce is a popular parallel processing framework for massive-scale data-intensive computing. The data-iterative application is composed of a serials of map/reduce jobs and need to repeatedly process some data files among these jobs. The existing implementation of map/reduce framework focus on perform data processing in a single pass with one map/reduce job and do not directly support the data-iterative applications, particularly in term of the explicit specification of the repeatedly processed data among jobs. In this paper, we propose an extended version of Hadoop map/reduce framework called Dacoop. Dacoop extends Map/Reduce programming interface to specify the repeatedly processed data, introduces the shared memory-based data cache mechanism to cache the data since its first access, and adopts the caching-aware task scheduling so that the cached data can be shared among the map/reduce jobs of data-iterative applications. We evaluate Dacoop on two typical data-iterative applications: k-means clustering and the domain rule reasoning in sementic web, with real and synthetic datasets. Experimental results show that the data-iterative applications can gain better performance on Dacoop than that on Hadoop. The turnaround time of a data-iterative application can be reduced by the maximum of 15.1%.