Optimizing data access latencies in cloud systems by intelligent virtual machine placement

2013 Proceedings IEEE INFOCOM Pub Date : 2013-04-14 DOI:10.1109/INFCOM.2013.6566850

M. Alicherry, T. V. Lakshman

{"title":"Optimizing data access latencies in cloud systems by intelligent virtual machine placement","authors":"M. Alicherry, T. V. Lakshman","doi":"10.1109/INFCOM.2013.6566850","DOIUrl":null,"url":null,"abstract":"Many cloud applications are data intensive requiring the processing of large data sets and the MapReduce/Hadoop architecture has become the de facto processing framework for these applications. Large data sets are stored in data nodes in the cloud which are typically SAN or NAS devices. Cloud applications process these data sets using a large number of application virtual machines (VMs), with the total completion time being an important performance metric. There are many factors that affect the total completion time of the processing task such as the load on the individual servers, the task scheduling mechanism, communication and data access bottlenecks, etc. One dominating factor that affects completion times for data intensive applications is the access latencies from processing nodes to data nodes. Ideally, one would like to keep all data access local to minimize access latency but this is often not possible due to the size of the data sets, capacity constraints in processing nodes which constrain VMs from being placed in their ideal location and so on. When it is not possible to keep all data access local, one would like to optimize the placement of VMs so that the impact of data access latencies on completion times is minimized. We address this problem of optimized VM placement - given the location of the data sets, we need to determine the locations for placing the VMs so as to minimize data access latencies while satisfying system constraints. We present optimal algorithms for determining the VM locations satisfying various constraints and with objectives that capture natural tradeoffs between minimizing latencies and incurring bandwidth costs. We also consider the problem of incorporating inter-VM latency constraints. In this case, the associated location problem is NP-hard with no effective approximation within a factor of 2 - ϵ for any ϵ > 0. We discuss an effective heuristic for this case and evaluate by simulation the impact of the various tradeoffs in the optimization objectives.","PeriodicalId":206346,"journal":{"name":"2013 Proceedings IEEE INFOCOM","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"120","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 Proceedings IEEE INFOCOM","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFCOM.2013.6566850","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 120

Abstract

Many cloud applications are data intensive requiring the processing of large data sets and the MapReduce/Hadoop architecture has become the de facto processing framework for these applications. Large data sets are stored in data nodes in the cloud which are typically SAN or NAS devices. Cloud applications process these data sets using a large number of application virtual machines (VMs), with the total completion time being an important performance metric. There are many factors that affect the total completion time of the processing task such as the load on the individual servers, the task scheduling mechanism, communication and data access bottlenecks, etc. One dominating factor that affects completion times for data intensive applications is the access latencies from processing nodes to data nodes. Ideally, one would like to keep all data access local to minimize access latency but this is often not possible due to the size of the data sets, capacity constraints in processing nodes which constrain VMs from being placed in their ideal location and so on. When it is not possible to keep all data access local, one would like to optimize the placement of VMs so that the impact of data access latencies on completion times is minimized. We address this problem of optimized VM placement - given the location of the data sets, we need to determine the locations for placing the VMs so as to minimize data access latencies while satisfying system constraints. We present optimal algorithms for determining the VM locations satisfying various constraints and with objectives that capture natural tradeoffs between minimizing latencies and incurring bandwidth costs. We also consider the problem of incorporating inter-VM latency constraints. In this case, the associated location problem is NP-hard with no effective approximation within a factor of 2 - ϵ for any ϵ > 0. We discuss an effective heuristic for this case and evaluate by simulation the impact of the various tradeoffs in the optimization objectives.

查看原文本刊更多论文

通过智能虚拟机布局优化云系统中的数据访问延迟

许多云应用程序都是数据密集型的，需要处理大型数据集，MapReduce/Hadoop架构已经成为这些应用程序事实上的处理框架。大型数据集存储在云中的数据节点中，这些节点通常是SAN或NAS设备。云应用程序使用大量的应用程序虚拟机(vm)处理这些数据集，总完成时间是一个重要的性能指标。有许多因素会影响处理任务的总完成时间，例如各个服务器上的负载、任务调度机制、通信和数据访问瓶颈等。影响数据密集型应用程序完成时间的一个主要因素是从处理节点到数据节点的访问延迟。理想情况下，人们希望将所有数据访问保持在本地，以最大限度地减少访问延迟，但由于数据集的大小，处理节点的容量限制(限制vm放置在理想位置)等原因，这通常是不可能的。当不可能将所有数据访问保持在本地时，可以优化vm的位置，以便将数据访问延迟对完成时间的影响降至最低。我们解决了优化VM放置的问题——给定数据集的位置，我们需要确定放置VM的位置，以便在满足系统约束的同时最小化数据访问延迟。我们提出了最优算法，用于确定满足各种约束的VM位置，并具有在最小化延迟和产生带宽成本之间捕获自然权衡的目标。我们还考虑了合并vm间延迟约束的问题。在这种情况下，相关的定位问题是np困难的，对于任何大于0的ε - ε，在系数2 - ε范围内没有有效的近似。我们针对这种情况讨论了一种有效的启发式方法，并通过模拟评估了优化目标中各种权衡的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 Proceedings IEEE INFOCOM

自引率

0.00%

发文量