Enabling Big Data Analytics in the Hybrid Cloud Using Iterative MapReduce

2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC) Pub Date : 2015-12-07 DOI:10.1109/UCC.2015.47

Francisco J. Clemente-Castelló, Bogdan Nicolae, K. Katrinis, M. M. Rafique, R. Mayo, J. C. Fernández, Daniela Loreti

{"title":"Enabling Big Data Analytics in the Hybrid Cloud Using Iterative MapReduce","authors":"Francisco J. Clemente-Castelló, Bogdan Nicolae, K. Katrinis, M. M. Rafique, R. Mayo, J. C. Fernández, Daniela Loreti","doi":"10.1109/UCC.2015.47","DOIUrl":null,"url":null,"abstract":"The cloud computing model has seen tremendous commercial success through its materialization via two prominent models to date, namely public and private cloud. Recently, a third model combining the former two service models as on-/off-premise resources has been receiving significant market traction: hybrid cloud. While state of art techniques that address workload performance prediction and efficient workload execution over hybrid cloud setups exist, how to address data-intensive workloads - including Big Data Analytics - in similar environments is nascent. This paper addresses this gap by taking on the challenge of bursting over hybrid clouds for the benefit of accelerating iterative MapReduce applications. We first specify the challenges associated with data locality and data movement in such setups. Subsequently, we propose a novel technique to address the locality issue, without requiring changes to the MapReduce framework or the underlying storage layer. In addition, we contribute with a performance prediction methodology that combines modeling with micro-benchmarks to estimate completion time for iterative MapReduce applications, which enables users to estimate cost-to-solution before committing extra resources from public clouds. We show through experimentation in a dual-Openstack hybrid cloud setup that our solutions manage to bring substantial improvement at predictable cost-control for two real-life iterative MapReduce applications: large-scale machine learning and text analysis.","PeriodicalId":381279,"journal":{"name":"2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UCC.2015.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

The cloud computing model has seen tremendous commercial success through its materialization via two prominent models to date, namely public and private cloud. Recently, a third model combining the former two service models as on-/off-premise resources has been receiving significant market traction: hybrid cloud. While state of art techniques that address workload performance prediction and efficient workload execution over hybrid cloud setups exist, how to address data-intensive workloads - including Big Data Analytics - in similar environments is nascent. This paper addresses this gap by taking on the challenge of bursting over hybrid clouds for the benefit of accelerating iterative MapReduce applications. We first specify the challenges associated with data locality and data movement in such setups. Subsequently, we propose a novel technique to address the locality issue, without requiring changes to the MapReduce framework or the underlying storage layer. In addition, we contribute with a performance prediction methodology that combines modeling with micro-benchmarks to estimate completion time for iterative MapReduce applications, which enables users to estimate cost-to-solution before committing extra resources from public clouds. We show through experimentation in a dual-Openstack hybrid cloud setup that our solutions manage to bring substantial improvement at predictable cost-control for two real-life iterative MapReduce applications: large-scale machine learning and text analysis.

查看原文本刊更多论文

使用迭代MapReduce实现混合云中的大数据分析

迄今为止，云计算模型通过公共云和私有云这两个突出的模型实现，取得了巨大的商业成功。最近，第三种结合了前两种服务模式作为内部/外部资源的模式已经获得了巨大的市场吸引力:混合云。虽然在混合云设置上解决工作负载性能预测和高效工作负载执行的最新技术已经存在，但如何在类似的环境中解决数据密集型工作负载(包括大数据分析)仍处于萌芽阶段。为了加速迭代MapReduce应用程序，本文通过挑战混合云来解决这一差距。我们首先指定在这种设置中与数据位置和数据移动相关的挑战。随后，我们提出了一种新的技术来解决局域性问题，而不需要改变MapReduce框架或底层存储层。此外，我们还提供了一种性能预测方法，该方法将建模与微基准相结合，以估计迭代MapReduce应用程序的完成时间，这使用户能够在从公共云提交额外资源之前估计解决方案的成本。我们通过双openstack混合云设置的实验表明，我们的解决方案在可预测的成本控制方面为两个现实生活中的迭代MapReduce应用程序带来了实质性的改进:大规模机器学习和文本分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC)

自引率

0.00%

发文量