Fault Tolerant Horizontal Computation Offloading

2023 IEEE International Conference on Edge Computing and Communications (EDGE) Pub Date : 2023-05-24 DOI:10.1109/EDGE60047.2023.00036

Alexander Droob, Daniel Morratz, Frederik Langkilde Jakobsen, Jacob Carstensen, Magnus Mathiesen, Rune Bohnstedt, M. Albano, Sergio Moreschini, D. Taibi

{"title":"Fault Tolerant Horizontal Computation Offloading","authors":"Alexander Droob, Daniel Morratz, Frederik Langkilde Jakobsen, Jacob Carstensen, Magnus Mathiesen, Rune Bohnstedt, M. Albano, Sergio Moreschini, D. Taibi","doi":"10.1109/EDGE60047.2023.00036","DOIUrl":null,"url":null,"abstract":"The broad development and usage of edge devices has highlighted the importance of creating resilient and computationally advanced edge-to-cloud continuum environments. When working with edge devices these desiderata are usually achieved through replication and offloading. This paper reports on the design and implementation of a fault-tolerant service that enables the offloading of jobs from devices with limited computational power. We propose a solution that allows users to upload jobs through a web service, which will be executed on edge nodes within the system. The solution is designed to be fault tolerant and scalable, with no single point of failure as well as the ability to accommodate growth, if the service is expanded. The use of Docker checkpointing on the worker machines ensures that jobs can be resumed in the event of a fault. We provide a mathematical approach to optimize the number of checkpoints that are created along a computation, given that we can forecast the time needed to execute a job. We present experiments that indicate in which scenarios checkpointing benefits job execution. Our experiments shows the benefits of using checkpointing and restore when the completion jobs’ time rises compared with the forecast fault rate.","PeriodicalId":369407,"journal":{"name":"2023 IEEE International Conference on Edge Computing and Communications (EDGE)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Edge Computing and Communications (EDGE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EDGE60047.2023.00036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The broad development and usage of edge devices has highlighted the importance of creating resilient and computationally advanced edge-to-cloud continuum environments. When working with edge devices these desiderata are usually achieved through replication and offloading. This paper reports on the design and implementation of a fault-tolerant service that enables the offloading of jobs from devices with limited computational power. We propose a solution that allows users to upload jobs through a web service, which will be executed on edge nodes within the system. The solution is designed to be fault tolerant and scalable, with no single point of failure as well as the ability to accommodate growth, if the service is expanded. The use of Docker checkpointing on the worker machines ensures that jobs can be resumed in the event of a fault. We provide a mathematical approach to optimize the number of checkpoints that are created along a computation, given that we can forecast the time needed to execute a job. We present experiments that indicate in which scenarios checkpointing benefits job execution. Our experiments shows the benefits of using checkpointing and restore when the completion jobs’ time rises compared with the forecast fault rate.

查看原文本刊更多论文

容错水平计算卸载

边缘设备的广泛发展和使用凸显了创建弹性和计算先进的边缘到云连续环境的重要性。在使用边缘设备时，通常通过复制和卸载来实现这些期望。本文报告了一种容错服务的设计和实现，该服务能够从计算能力有限的设备上卸载作业。我们提出了一个解决方案，允许用户通过web服务上传作业，该服务将在系统内的边缘节点上执行。该解决方案被设计为容错和可伸缩的，没有单点故障，并且能够在服务扩展时适应增长。在工作机器上使用Docker检查点确保在发生故障时可以恢复作业。我们提供了一种数学方法来优化沿着计算创建的检查点的数量，假设我们可以预测执行作业所需的时间。我们提出的实验表明，在哪些情况下检查点有利于作业执行。实验表明，当完成作业的时间比预测的故障率增加时，使用检查点和恢复是有好处的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE International Conference on Edge Computing and Communications (EDGE)

自引率

0.00%

发文量