A Big Data Placement Strategy in Geographically Distributed Datacenters

2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech) Pub Date : 2020-11-24 DOI:10.1109/CloudTech49835.2020.9365881

L. Bouhouch, M. Zbakh, C. Tadonki

{"title":"A Big Data Placement Strategy in Geographically Distributed Datacenters","authors":"L. Bouhouch, M. Zbakh, C. Tadonki","doi":"10.1109/CloudTech49835.2020.9365881","DOIUrl":null,"url":null,"abstract":"With the pervasiveness of the \"Big Data\" characteristic together with the expansion of geographically distributed datacenters in the Cloud computing context, processing large- scale data applications has become a crucial issue. Indeed, the task of finding the most efficient way of storing massive data across distributed locations is increasingly complex. Furthermore, the execution time of a given task that requires several datasets might be dominated by the cost of data migrations/exchanges, which depends on the initial placement of the input datasets over the set of datacenters in the Cloud and also on the dynamic data management strategy. In this paper, we propose a data placement strategy to improve the workflow execution time through the reduction of the cost associated to data movements between geographically distributed datacenters, considering their characteristics such as storage capacity and read/write speeds. We formalize the overall problem and then propose a data placement algorithm structured into two phases. First, we compute the estimated transfer time to move all involved datasets from their respective locations to the one where the corresponding tasks are executed. Second, we apply a greedy algorithm in order to assign each dataset to the optimal datacenter w.r.t the overall cost of data migrations. The heterogeneity of the datacenters together with their characteristics (storage and bandwidth) are both taken into account. Our experiments are conducted using Cloudsim simulator. The obtained results show that our proposed strategy produces an efficient placement and actually reduces the overheads of the data movement compared to both a random assignment and a selected placement algorithm from the literature.","PeriodicalId":272860,"journal":{"name":"2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech)","volume":"47 12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudTech49835.2020.9365881","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

With the pervasiveness of the "Big Data" characteristic together with the expansion of geographically distributed datacenters in the Cloud computing context, processing large- scale data applications has become a crucial issue. Indeed, the task of finding the most efficient way of storing massive data across distributed locations is increasingly complex. Furthermore, the execution time of a given task that requires several datasets might be dominated by the cost of data migrations/exchanges, which depends on the initial placement of the input datasets over the set of datacenters in the Cloud and also on the dynamic data management strategy. In this paper, we propose a data placement strategy to improve the workflow execution time through the reduction of the cost associated to data movements between geographically distributed datacenters, considering their characteristics such as storage capacity and read/write speeds. We formalize the overall problem and then propose a data placement algorithm structured into two phases. First, we compute the estimated transfer time to move all involved datasets from their respective locations to the one where the corresponding tasks are executed. Second, we apply a greedy algorithm in order to assign each dataset to the optimal datacenter w.r.t the overall cost of data migrations. The heterogeneity of the datacenters together with their characteristics (storage and bandwidth) are both taken into account. Our experiments are conducted using Cloudsim simulator. The obtained results show that our proposed strategy produces an efficient placement and actually reduces the overheads of the data movement compared to both a random assignment and a selected placement algorithm from the literature.

查看原文本刊更多论文

地理分布数据中心中的大数据放置策略

随着“大数据”特征的普及以及云计算环境下地理分布式数据中心的扩展，处理大规模数据应用已成为一个关键问题。事实上，寻找跨分布式位置存储海量数据的最有效方法的任务正变得越来越复杂。此外，需要多个数据集的给定任务的执行时间可能由数据迁移/交换的成本决定，这取决于输入数据集在云中数据中心集上的初始位置，也取决于动态数据管理策略。在本文中，我们提出了一种数据放置策略，通过减少与地理分布的数据中心之间的数据移动相关的成本来改善工作流执行时间，同时考虑到它们的特征，如存储容量和读/写速度。我们将整个问题形式化，然后提出了一个分为两个阶段的数据放置算法。首先，我们计算将所有涉及的数据集从各自的位置移动到执行相应任务的位置的估计传输时间。其次，我们应用贪婪算法将每个数据集分配到最优数据中心，而不考虑数据迁移的总成本。数据中心的异构性及其特性(存储和带宽)都被考虑在内。我们的实验是使用Cloudsim模拟器进行的。得到的结果表明，与随机分配和文献中选择的放置算法相比，我们提出的策略产生了有效的放置，并且实际上减少了数据移动的开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech)

自引率

0.00%

发文量