H. Vu, S. Kajkamhaeng, Shinya Takamaeda-Yamazaki, Y. Nakashima
{"title":"CPRtree: A Tree-Based Checkpointing Architecture for Heterogeneous FPGA Computing","authors":"H. Vu, S. Kajkamhaeng, Shinya Takamaeda-Yamazaki, Y. Nakashima","doi":"10.1109/CANDAR.2016.0024","DOIUrl":null,"url":null,"abstract":"FPGAs provide reconfigurability and high performance for parallel applications. Modern FPGAs can be integrated in computing systems as accelerators so that they can combine with host CPU to execute offload applications. This integration puts more pressure on the fault tolerance of computing systems and the question how to improve the dependability becomes crucial. Similar to CPU-based system, checkpoint/restart techniques are expected to be developed and applied to FPGA-based computing systems. There are two issues rising in this situation: how to checkpoint and restart FPGA, and how this checkpoint/restart model works well with the checkpoint/restart model of the whole computing system. In this paper, first we propose a new checkpoint/restart architecture along with a checkpointing mechanism on FPGA. Second, we propose \"fine-grain\" management for checkpointing to reduce performance degradation. Third, we propose a technique to capture consistent snapshots of FPGA and the rest of the computing system. For host software, we also provide CPRtree stack including API functions to manage checkpoint/restart procedures on FPGA. Our experimental results show that the checkpointing architecture causes up to 9.73% maximum clock frequency degradation, small breakdown, and small data footprint, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDAR.2016.0024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
FPGAs provide reconfigurability and high performance for parallel applications. Modern FPGAs can be integrated in computing systems as accelerators so that they can combine with host CPU to execute offload applications. This integration puts more pressure on the fault tolerance of computing systems and the question how to improve the dependability becomes crucial. Similar to CPU-based system, checkpoint/restart techniques are expected to be developed and applied to FPGA-based computing systems. There are two issues rising in this situation: how to checkpoint and restart FPGA, and how this checkpoint/restart model works well with the checkpoint/restart model of the whole computing system. In this paper, first we propose a new checkpoint/restart architecture along with a checkpointing mechanism on FPGA. Second, we propose "fine-grain" management for checkpointing to reduce performance degradation. Third, we propose a technique to capture consistent snapshots of FPGA and the rest of the computing system. For host software, we also provide CPRtree stack including API functions to manage checkpoint/restart procedures on FPGA. Our experimental results show that the checkpointing architecture causes up to 9.73% maximum clock frequency degradation, small breakdown, and small data footprint, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).