Workload Partitioning and Task Migration to Reduce Response Times in Heterogeneous Computing Environments

2018 27th International Conference on Computer Communication and Networks (ICCCN) Pub Date : 2018-07-01 DOI:10.1109/ICCCN.2018.8487326

Dominik Schäfer, Janick Edinger, Martin Breitbach, C. Becker

{"title":"Workload Partitioning and Task Migration to Reduce Response Times in Heterogeneous Computing Environments","authors":"Dominik Schäfer, Janick Edinger, Martin Breitbach, C. Becker","doi":"10.1109/ICCCN.2018.8487326","DOIUrl":null,"url":null,"abstract":"Today's modern computing landscape consists of a huge amount of heterogeneous devices, including powerful, stable desktop computers as well as lightweight, unreliable mobile edge devices. This heterogeneity in terms of computation power and reliability increases the complexity for fault tolerance in distributed computing systems. When tasks are offloaded, slow resource providers easily become the bottleneck of a parallel computation. Further, unstable edge devices can leave the system spontaneously, discontinue remote tasks executions, and therefore lose the computation progress. These two effects increase the response time for remote task executions. In this paper, we introduce two mechanisms to avoid delayed or lost task executions caused by edge devices. This paper has five contributions. First, we define a failure model and identify the parameters that determine the magnitude of delays caused by faults and performance bottlenecks. Second, we present reactive and proactive task migration to handle system leaves. Third, we show how computational bottlenecks can be avoided by two-dimensional context-aware task partitioning. Fourth, we integrate these two solutions into an existing heterogeneous distributed computing system. Fifth, we run an evaluation on a real- world testbed to show the benefits of the solutions in practice. The evaluation shows, that we can improve systems with device fluctuation and heterogeneity by up to 39% and 53% respectively.","PeriodicalId":399145,"journal":{"name":"2018 27th International Conference on Computer Communication and Networks (ICCCN)","volume":"324 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 27th International Conference on Computer Communication and Networks (ICCCN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCN.2018.8487326","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Today's modern computing landscape consists of a huge amount of heterogeneous devices, including powerful, stable desktop computers as well as lightweight, unreliable mobile edge devices. This heterogeneity in terms of computation power and reliability increases the complexity for fault tolerance in distributed computing systems. When tasks are offloaded, slow resource providers easily become the bottleneck of a parallel computation. Further, unstable edge devices can leave the system spontaneously, discontinue remote tasks executions, and therefore lose the computation progress. These two effects increase the response time for remote task executions. In this paper, we introduce two mechanisms to avoid delayed or lost task executions caused by edge devices. This paper has five contributions. First, we define a failure model and identify the parameters that determine the magnitude of delays caused by faults and performance bottlenecks. Second, we present reactive and proactive task migration to handle system leaves. Third, we show how computational bottlenecks can be avoided by two-dimensional context-aware task partitioning. Fourth, we integrate these two solutions into an existing heterogeneous distributed computing system. Fifth, we run an evaluation on a real- world testbed to show the benefits of the solutions in practice. The evaluation shows, that we can improve systems with device fluctuation and heterogeneity by up to 39% and 53% respectively.

查看原文本刊更多论文

异构计算环境下减少响应时间的工作负载分区和任务迁移

当今的现代计算领域由大量异构设备组成，包括功能强大、稳定的台式计算机以及轻量级、不可靠的移动边缘设备。这种计算能力和可靠性方面的异构性增加了分布式计算系统容错的复杂性。当任务被卸载时，缓慢的资源提供者很容易成为并行计算的瓶颈。此外，不稳定的边缘设备可能会自发离开系统，中断远程任务的执行，从而丢失计算进度。这两种效果增加了远程任务执行的响应时间。在本文中，我们引入了两种机制来避免由边缘设备引起的任务执行延迟或丢失。这篇论文有五个贡献。首先，我们定义了一个故障模型，并确定了由故障和性能瓶颈引起的延迟大小的参数。其次，我们提出了被动和主动任务迁移来处理系统叶子。第三，我们展示了如何通过二维上下文感知任务分区来避免计算瓶颈。第四，我们将这两种解决方案集成到现有的异构分布式计算系统中。第五，我们在一个真实世界的测试平台上进行了评估，以显示解决方案在实践中的好处。评估结果表明，该方法可将存在器件波动和异构的系统分别提高39%和53%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 27th International Conference on Computer Communication and Networks (ICCCN)

自引率

0.00%

发文量