Markov Chain-based Modeling and Analysis of Checkpointing with Rollback Recovery for Efficient DSE in Soft Real-time Systems

2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) Pub Date : 2020-10-19 DOI:10.1109/DFT50435.2020.9250892

Siva Satyendra Sahoo, B. Veeravalli, Akash Kumar

{"title":"Markov Chain-based Modeling and Analysis of Checkpointing with Rollback Recovery for Efficient DSE in Soft Real-time Systems","authors":"Siva Satyendra Sahoo, B. Veeravalli, Akash Kumar","doi":"10.1109/DFT50435.2020.9250892","DOIUrl":null,"url":null,"abstract":"Continued transistor scaling and increasing power density have led to an increase in both transient and aging-related fault-rates in silicon-based electronic systems. The use of traditional spatial redundancy-based fault-mitigation methods such as Triple Modular Redundancy (TMR) can lead to even higher power dissipation. However, in addition to accelerating the system’s rate of aging, such high power dissipation may be infeasible for resource-constrained embedded systems. Consequently, temporal redundancy-based methods are being increasingly used for satisfying embedded applications’ reliability requirements. However, such methods result in stochastic execution time and hence introduce additional complexity for soft real-time system design. A simulation-based approach for finding the quality of service (QoS)-aware optimal design points can lead to large design space exploration (DSE) time. To this end, we propose a Markov Chain-based model for checkpointing with rollback recovery, a widely used temporal redundancy-based fault-mitigation method. Specifically, a task execution using Checkpointing with intermediate validations is modeled as an absorbing Markov Chain and methods are presented for estimating the mean, variance and probability distribution of the task’s resulting execution time. Further, we propose a multilevel design space pruning approach for determining the QoS-aware configuration of Checkpointing. The presented modeling and estimation methods lead to considerable improvements in DSE time compared to a simulation-only approach.","PeriodicalId":340119,"journal":{"name":"2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DFT50435.2020.9250892","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Continued transistor scaling and increasing power density have led to an increase in both transient and aging-related fault-rates in silicon-based electronic systems. The use of traditional spatial redundancy-based fault-mitigation methods such as Triple Modular Redundancy (TMR) can lead to even higher power dissipation. However, in addition to accelerating the system’s rate of aging, such high power dissipation may be infeasible for resource-constrained embedded systems. Consequently, temporal redundancy-based methods are being increasingly used for satisfying embedded applications’ reliability requirements. However, such methods result in stochastic execution time and hence introduce additional complexity for soft real-time system design. A simulation-based approach for finding the quality of service (QoS)-aware optimal design points can lead to large design space exploration (DSE) time. To this end, we propose a Markov Chain-based model for checkpointing with rollback recovery, a widely used temporal redundancy-based fault-mitigation method. Specifically, a task execution using Checkpointing with intermediate validations is modeled as an absorbing Markov Chain and methods are presented for estimating the mean, variance and probability distribution of the task’s resulting execution time. Further, we propose a multilevel design space pruning approach for determining the QoS-aware configuration of Checkpointing. The presented modeling and estimation methods lead to considerable improvements in DSE time compared to a simulation-only approach.

查看原文本刊更多论文

基于马尔可夫链的软实时系统有效DSE检查点与回滚恢复建模与分析

在硅基电子系统中，晶体管的持续缩放和功率密度的增加导致了瞬态和老化相关故障率的增加。使用传统的基于空间冗余的故障缓解方法，如三模冗余(TMR)，可能会导致更高的功耗。然而，除了加速系统的老化速度外，如此高的功耗对于资源受限的嵌入式系统可能是不可行的。因此，基于时间冗余的方法被越来越多地用于满足嵌入式应用的可靠性需求。然而，这种方法导致执行时间随机，从而给软实时系统设计带来了额外的复杂性。基于仿真的方法寻找服务质量(QoS)感知的最优设计点可能导致较大的设计空间探索(DSE)时间。为此，我们提出了一种基于马尔可夫链的带回滚恢复的检查点模型，这是一种广泛使用的基于时间冗余的故障缓解方法。具体来说，使用带有中间验证的检查点的任务执行被建模为吸收马尔可夫链，并提出了估计任务最终执行时间的均值、方差和概率分布的方法。此外，我们提出了一种多层设计空间修剪方法来确定检查点的qos感知配置。与仅模拟的方法相比，所提出的建模和估计方法可显著改善DSE时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)

自引率

0.00%

发文量