A Framework for Data-Intensive Computing with Cloud Bursting

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI:10.1145/2148600.2148604

Tekin Bicer, David Chiu, G. Agrawal

{"title":"A Framework for Data-Intensive Computing with Cloud Bursting","authors":"Tekin Bicer, David Chiu, G. Agrawal","doi":"10.1145/2148600.2148604","DOIUrl":null,"url":null,"abstract":"For many organizations, one attractive use of cloud resources can be through what is referred to as cloud bursting or the hybrid cloud. These refer to scenarios where an organization acquires and manages in-house resources to meet its base need, but can use additional resources from a cloud provider to maintain an acceptable response time during workload peaks. Cloud bursting has so far been discussed in the context of using additional computing resources from a cloud provider. However, as next generation applications are expected to see orders of magnitude increase in data set sizes, cloud resources can be used to store additional data after local resources are exhausted. In this paper, we consider the challenge of data analysis in a scenario where data is stored across a local cluster and cloud resources. We describe a software framework to enable data-intensive computing with cloud bursting, i.e., using a combination of compute resources from a local cluster and a cloud environment to perform Map-Reduce type processing on a data set that is geographically distributed. Our evaluation with three different applications shows that data-intensive computing with cloud bursting is feasible and scalable. Particularly, as compared to a situation where the data set is stored at one location and processed using resources at that end, the average slowdown of our system (using distributed but the same aggregate number of compute resources), is only 15.55%. Thus, the overheads due to global reduction, remote data retrieval, and potential load imbalance are quite manageable. Our system scales with an average speedup of 81% when the number of compute resources is doubled.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2148600.2148604","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 49

Abstract

For many organizations, one attractive use of cloud resources can be through what is referred to as cloud bursting or the hybrid cloud. These refer to scenarios where an organization acquires and manages in-house resources to meet its base need, but can use additional resources from a cloud provider to maintain an acceptable response time during workload peaks. Cloud bursting has so far been discussed in the context of using additional computing resources from a cloud provider. However, as next generation applications are expected to see orders of magnitude increase in data set sizes, cloud resources can be used to store additional data after local resources are exhausted. In this paper, we consider the challenge of data analysis in a scenario where data is stored across a local cluster and cloud resources. We describe a software framework to enable data-intensive computing with cloud bursting, i.e., using a combination of compute resources from a local cluster and a cloud environment to perform Map-Reduce type processing on a data set that is geographically distributed. Our evaluation with three different applications shows that data-intensive computing with cloud bursting is feasible and scalable. Particularly, as compared to a situation where the data set is stored at one location and processed using resources at that end, the average slowdown of our system (using distributed but the same aggregate number of compute resources), is only 15.55%. Thus, the overheads due to global reduction, remote data retrieval, and potential load imbalance are quite manageable. Our system scales with an average speedup of 81% when the number of compute resources is doubled.

查看原文本刊更多论文

基于云爆发的数据密集型计算框架

对于许多组织来说，云资源的一个有吸引力的用途是通过所谓的云爆发或混合云。这些是指组织获取和管理内部资源以满足其基本需求的场景，但可以使用来自云提供商的额外资源来在工作负载高峰期间保持可接受的响应时间。到目前为止，云爆发一直是在使用云提供商提供的额外计算资源的背景下讨论的。然而，由于下一代应用程序的数据集大小预计会有数量级的增长，因此在本地资源耗尽后，可以使用云资源来存储额外的数据。在本文中，我们考虑了数据存储在本地集群和云资源中的场景中数据分析的挑战。我们描述了一个软件框架，使数据密集型计算与云爆发，即，使用来自本地集群和云环境的计算资源的组合，在地理分布的数据集上执行Map-Reduce类型的处理。我们对三个不同应用程序的评估表明，具有云爆发的数据密集型计算是可行的和可扩展的。特别是，与数据集存储在一个位置并使用该位置的资源进行处理的情况相比，我们的系统(使用分布式但相同总数的计算资源)的平均减速速度仅为15.55%。因此，由于全局缩减、远程数据检索和潜在负载不平衡而产生的开销是相当可管理的。当计算资源的数量增加一倍时，我们的系统的平均加速提升为81%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量