En-Chung Yang, Keheng Huang, Yu Hu, Xiaowei Li, Jian Gong, Hongjin Liu, Bo Liu
{"title":"HHC: Hierarchical hardware checkpointing to accelerate fault recovery for SRAM-based FPGAs","authors":"En-Chung Yang, Keheng Huang, Yu Hu, Xiaowei Li, Jian Gong, Hongjin Liu, Bo Liu","doi":"10.1109/IOLTS.2013.6604078","DOIUrl":null,"url":null,"abstract":"As the feature size shrinks to the nanometer scale, SRAM-based FPGAs are increasingly vulnerable to soft errors. Checkpointing is an effective fault recovery technique that can restore the faulty system to its previous fault free state. Since the function of the system needs to be suspended during checkpoint saving and checkpoint restoring, so the Mean Time to Repair (MTTR) of the system is critical to the system performance. In this work, we propose a hierarchical hardware checkpointing (HHC) technique that contains a high-speed on-chip checkpoint and a low-speed off-chip checkpoint to accelerate fault recovery for SRAM-based FPGAs. Most of single event effect (SEE) faults can be recovered by the high-speed on-chip checkpoint, which significantly reduces the MTTR of the system. The memory resource occupation of the on-chip checkpoint is low because HHC only stores the logic states of user bits and check information for configuration bits. Experimental results show that, compared with traditional off-chip checkpoint strategies, the proposed technique can reduce the MTTR of the system by 94.30%. In addition, the memory resource occupation is 11.11% of FPGAs, a little high but can be further optimized.","PeriodicalId":423175,"journal":{"name":"2013 IEEE 19th International On-Line Testing Symposium (IOLTS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 19th International On-Line Testing Symposium (IOLTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IOLTS.2013.6604078","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
As the feature size shrinks to the nanometer scale, SRAM-based FPGAs are increasingly vulnerable to soft errors. Checkpointing is an effective fault recovery technique that can restore the faulty system to its previous fault free state. Since the function of the system needs to be suspended during checkpoint saving and checkpoint restoring, so the Mean Time to Repair (MTTR) of the system is critical to the system performance. In this work, we propose a hierarchical hardware checkpointing (HHC) technique that contains a high-speed on-chip checkpoint and a low-speed off-chip checkpoint to accelerate fault recovery for SRAM-based FPGAs. Most of single event effect (SEE) faults can be recovered by the high-speed on-chip checkpoint, which significantly reduces the MTTR of the system. The memory resource occupation of the on-chip checkpoint is low because HHC only stores the logic states of user bits and check information for configuration bits. Experimental results show that, compared with traditional off-chip checkpoint strategies, the proposed technique can reduce the MTTR of the system by 94.30%. In addition, the memory resource occupation is 11.11% of FPGAs, a little high but can be further optimized.
随着特征尺寸缩小到纳米尺度,基于sram的fpga越来越容易受到软误差的影响。检查点是一种有效的故障恢复技术,可以将故障系统恢复到以前的无故障状态。由于在检查点保存和检查点恢复过程中需要暂停系统的功能,因此系统的平均修复时间(Mean Time to Repair, MTTR)对系统的性能至关重要。在这项工作中,我们提出了一种分层硬件检查点(HHC)技术,该技术包含一个高速片内检查点和一个低速片外检查点,以加速基于sram的fpga的故障恢复。大部分的单事件效应(SEE)故障都可以通过高速片上检查点恢复,大大降低了系统的MTTR。片上检查点的内存资源占用很低,因为HHC只存储用户位的逻辑状态和配置位的检查信息。实验结果表明,与传统的片外检查点策略相比,该技术可将系统的MTTR降低94.30%。另外,fpga的内存资源占用为11.11%,稍高,但可以进一步优化。