Fault Tolerant Horizontal Computation Offloading

Alexander Droob, Daniel Morratz, Frederik Langkilde Jakobsen, Jacob Carstensen, Magnus Mathiesen, Rune Bohnstedt, M. Albano, Sergio Moreschini, D. Taibi
{"title":"Fault Tolerant Horizontal Computation Offloading","authors":"Alexander Droob, Daniel Morratz, Frederik Langkilde Jakobsen, Jacob Carstensen, Magnus Mathiesen, Rune Bohnstedt, M. Albano, Sergio Moreschini, D. Taibi","doi":"10.1109/EDGE60047.2023.00036","DOIUrl":null,"url":null,"abstract":"The broad development and usage of edge devices has highlighted the importance of creating resilient and computationally advanced edge-to-cloud continuum environments. When working with edge devices these desiderata are usually achieved through replication and offloading. This paper reports on the design and implementation of a fault-tolerant service that enables the offloading of jobs from devices with limited computational power. We propose a solution that allows users to upload jobs through a web service, which will be executed on edge nodes within the system. The solution is designed to be fault tolerant and scalable, with no single point of failure as well as the ability to accommodate growth, if the service is expanded. The use of Docker checkpointing on the worker machines ensures that jobs can be resumed in the event of a fault. We provide a mathematical approach to optimize the number of checkpoints that are created along a computation, given that we can forecast the time needed to execute a job. We present experiments that indicate in which scenarios checkpointing benefits job execution. Our experiments shows the benefits of using checkpointing and restore when the completion jobs’ time rises compared with the forecast fault rate.","PeriodicalId":369407,"journal":{"name":"2023 IEEE International Conference on Edge Computing and Communications (EDGE)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Edge Computing and Communications (EDGE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EDGE60047.2023.00036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The broad development and usage of edge devices has highlighted the importance of creating resilient and computationally advanced edge-to-cloud continuum environments. When working with edge devices these desiderata are usually achieved through replication and offloading. This paper reports on the design and implementation of a fault-tolerant service that enables the offloading of jobs from devices with limited computational power. We propose a solution that allows users to upload jobs through a web service, which will be executed on edge nodes within the system. The solution is designed to be fault tolerant and scalable, with no single point of failure as well as the ability to accommodate growth, if the service is expanded. The use of Docker checkpointing on the worker machines ensures that jobs can be resumed in the event of a fault. We provide a mathematical approach to optimize the number of checkpoints that are created along a computation, given that we can forecast the time needed to execute a job. We present experiments that indicate in which scenarios checkpointing benefits job execution. Our experiments shows the benefits of using checkpointing and restore when the completion jobs’ time rises compared with the forecast fault rate.
容错水平计算卸载
边缘设备的广泛发展和使用凸显了创建弹性和计算先进的边缘到云连续环境的重要性。在使用边缘设备时,通常通过复制和卸载来实现这些期望。本文报告了一种容错服务的设计和实现,该服务能够从计算能力有限的设备上卸载作业。我们提出了一个解决方案,允许用户通过web服务上传作业,该服务将在系统内的边缘节点上执行。该解决方案被设计为容错和可伸缩的,没有单点故障,并且能够在服务扩展时适应增长。在工作机器上使用Docker检查点确保在发生故障时可以恢复作业。我们提供了一种数学方法来优化沿着计算创建的检查点的数量,假设我们可以预测执行作业所需的时间。我们提出的实验表明,在哪些情况下检查点有利于作业执行。实验表明,当完成作业的时间比预测的故障率增加时,使用检查点和恢复是有好处的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信