{"title":"Global reduction for geo-distributed MapReduce across cloud federation","authors":"Thouraya Gouasmi , Ahmed Hadj Kacem","doi":"10.1016/j.future.2024.107492","DOIUrl":null,"url":null,"abstract":"<div><p>Geo-distributed Bigdata processing is increasing day by day, resulting in the origins of data that are geographically distributed in different countries and hold datacenters (DCs) across the globe, and also the applications that use different sites to increase reliability, security, and processing performances. Most popular frameworks like Hadoop and Spark are re-designed to process geographically distributed data at their locations. However, these methods still suffer from a large amount of data transfer over the Internet, which prohibits a high processing time and cost for many applications, and in several cases, the output results of the computation are smaller than its inputs. In this paper, we keep the data locality principle for processing data at different locations but ignore the principle of transferring the entire intermediate results to a single global reducer. We propose Geo-MR, an intelligent geo-distributed MapReduce-based framework across federated cloud based on two heuristic algorithms: (i) chosen the best clusters as global reducers to reduce the communication and optimize the transfer on the bandwidth, GResearch. (ii) The second, Geo-MR, ensures the scheduling of only the relevant data to selected global reducers that process the final results. As a baseline, we propose an exact MapReduce scheduling model for benchmarking and to compare and discuss the Geo-MR heuristic algorithm results. The experimental results show that the proposed algorithm Geo-MR can improve resource (bandwidth and VMs of clusters) utilization of the cloud federation and consequently reduce cost and job response time.</p></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"162 ","pages":"Article 107492"},"PeriodicalIF":6.2000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24004485","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Geo-distributed Bigdata processing is increasing day by day, resulting in the origins of data that are geographically distributed in different countries and hold datacenters (DCs) across the globe, and also the applications that use different sites to increase reliability, security, and processing performances. Most popular frameworks like Hadoop and Spark are re-designed to process geographically distributed data at their locations. However, these methods still suffer from a large amount of data transfer over the Internet, which prohibits a high processing time and cost for many applications, and in several cases, the output results of the computation are smaller than its inputs. In this paper, we keep the data locality principle for processing data at different locations but ignore the principle of transferring the entire intermediate results to a single global reducer. We propose Geo-MR, an intelligent geo-distributed MapReduce-based framework across federated cloud based on two heuristic algorithms: (i) chosen the best clusters as global reducers to reduce the communication and optimize the transfer on the bandwidth, GResearch. (ii) The second, Geo-MR, ensures the scheduling of only the relevant data to selected global reducers that process the final results. As a baseline, we propose an exact MapReduce scheduling model for benchmarking and to compare and discuss the Geo-MR heuristic algorithm results. The experimental results show that the proposed algorithm Geo-MR can improve resource (bandwidth and VMs of clusters) utilization of the cloud federation and consequently reduce cost and job response time.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.