Autonomous, failure-resilient orchestration of distributed discrete event simulations

Matthew Malensek, Z. Sui, Neil Harvey, S. Pallickara
{"title":"Autonomous, failure-resilient orchestration of distributed discrete event simulations","authors":"Matthew Malensek, Z. Sui, Neil Harvey, S. Pallickara","doi":"10.1145/2494621.2494625","DOIUrl":null,"url":null,"abstract":"Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task communication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes.\n In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.","PeriodicalId":190559,"journal":{"name":"ACM Cloud and Autonomic Computing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Cloud and Autonomic Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2494621.2494625","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task communication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes. In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.
分布式离散事件模拟的自治、故障弹性编排
离散事件模拟模拟复杂的现实世界系统的行为。模拟大范围的相关事件和条件自然会提供更准确的模型,但也会增加与模拟相关的计算工作量。为了以可伸缩的方式管理这些处理需求,可以将离散事件模拟分布在许多计算资源上。然而,仿真中的单个任务是有状态的,因此需要任务间的通信和同步来产生准确的模型。此属性不仅使分布式设置中的离散事件模拟的编排变得复杂,而且还使提供可靠的容错执行成为一项挑战,特别是与传统的分布式容错方案相比时。在本文中,我们提出了一种自治代理,通过预测模拟中的状态变化并相应地调整其容错策略,为离散事件模拟提供容错功能。这允许系统避免对总体执行时间产生负面影响,同时保持可靠性保证。为了强调我们的解决方案的可行性,我们提供了一个生产离散事件模拟的基准,它可以在容错框架的监督下运行时维持故障。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信