{"title":"Markov Chain-based Modeling and Analysis of Checkpointing with Rollback Recovery for Efficient DSE in Soft Real-time Systems","authors":"Siva Satyendra Sahoo, B. Veeravalli, Akash Kumar","doi":"10.1109/DFT50435.2020.9250892","DOIUrl":null,"url":null,"abstract":"Continued transistor scaling and increasing power density have led to an increase in both transient and aging-related fault-rates in silicon-based electronic systems. The use of traditional spatial redundancy-based fault-mitigation methods such as Triple Modular Redundancy (TMR) can lead to even higher power dissipation. However, in addition to accelerating the system’s rate of aging, such high power dissipation may be infeasible for resource-constrained embedded systems. Consequently, temporal redundancy-based methods are being increasingly used for satisfying embedded applications’ reliability requirements. However, such methods result in stochastic execution time and hence introduce additional complexity for soft real-time system design. A simulation-based approach for finding the quality of service (QoS)-aware optimal design points can lead to large design space exploration (DSE) time. To this end, we propose a Markov Chain-based model for checkpointing with rollback recovery, a widely used temporal redundancy-based fault-mitigation method. Specifically, a task execution using Checkpointing with intermediate validations is modeled as an absorbing Markov Chain and methods are presented for estimating the mean, variance and probability distribution of the task’s resulting execution time. Further, we propose a multilevel design space pruning approach for determining the QoS-aware configuration of Checkpointing. The presented modeling and estimation methods lead to considerable improvements in DSE time compared to a simulation-only approach.","PeriodicalId":340119,"journal":{"name":"2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DFT50435.2020.9250892","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Continued transistor scaling and increasing power density have led to an increase in both transient and aging-related fault-rates in silicon-based electronic systems. The use of traditional spatial redundancy-based fault-mitigation methods such as Triple Modular Redundancy (TMR) can lead to even higher power dissipation. However, in addition to accelerating the system’s rate of aging, such high power dissipation may be infeasible for resource-constrained embedded systems. Consequently, temporal redundancy-based methods are being increasingly used for satisfying embedded applications’ reliability requirements. However, such methods result in stochastic execution time and hence introduce additional complexity for soft real-time system design. A simulation-based approach for finding the quality of service (QoS)-aware optimal design points can lead to large design space exploration (DSE) time. To this end, we propose a Markov Chain-based model for checkpointing with rollback recovery, a widely used temporal redundancy-based fault-mitigation method. Specifically, a task execution using Checkpointing with intermediate validations is modeled as an absorbing Markov Chain and methods are presented for estimating the mean, variance and probability distribution of the task’s resulting execution time. Further, we propose a multilevel design space pruning approach for determining the QoS-aware configuration of Checkpointing. The presented modeling and estimation methods lead to considerable improvements in DSE time compared to a simulation-only approach.