Autonomous, failure-resilient orchestration of distributed discrete event simulations

ACM Cloud and Autonomic Computing Conference Pub Date : 2013-08-09 DOI:10.1145/2494621.2494625

Matthew Malensek, Z. Sui, Neil Harvey, S. Pallickara

{"title":"Autonomous, failure-resilient orchestration of distributed discrete event simulations","authors":"Matthew Malensek, Z. Sui, Neil Harvey, S. Pallickara","doi":"10.1145/2494621.2494625","DOIUrl":null,"url":null,"abstract":"Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task communication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes.\n In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.","PeriodicalId":190559,"journal":{"name":"ACM Cloud and Autonomic Computing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Cloud and Autonomic Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2494621.2494625","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task communication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes. In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.

查看原文本刊更多论文

分布式离散事件模拟的自治、故障弹性编排

离散事件模拟模拟复杂的现实世界系统的行为。模拟大范围的相关事件和条件自然会提供更准确的模型，但也会增加与模拟相关的计算工作量。为了以可伸缩的方式管理这些处理需求，可以将离散事件模拟分布在许多计算资源上。然而，仿真中的单个任务是有状态的，因此需要任务间的通信和同步来产生准确的模型。此属性不仅使分布式设置中的离散事件模拟的编排变得复杂，而且还使提供可靠的容错执行成为一项挑战，特别是与传统的分布式容错方案相比时。在本文中，我们提出了一种自治代理，通过预测模拟中的状态变化并相应地调整其容错策略，为离散事件模拟提供容错功能。这允许系统避免对总体执行时间产生负面影响，同时保持可靠性保证。为了强调我们的解决方案的可行性，我们提供了一个生产离散事件模拟的基准，它可以在容错框架的监督下运行时维持故障。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Cloud and Autonomic Computing Conference

自引率

0.00%

发文量