CPRtree: A Tree-Based Checkpointing Architecture for Heterogeneous FPGA Computing

2016 Fourth International Symposium on Computing and Networking (CANDAR) Pub Date : 2016-11-01 DOI:10.1109/CANDAR.2016.0024

H. Vu, S. Kajkamhaeng, Shinya Takamaeda-Yamazaki, Y. Nakashima

{"title":"CPRtree: A Tree-Based Checkpointing Architecture for Heterogeneous FPGA Computing","authors":"H. Vu, S. Kajkamhaeng, Shinya Takamaeda-Yamazaki, Y. Nakashima","doi":"10.1109/CANDAR.2016.0024","DOIUrl":null,"url":null,"abstract":"FPGAs provide reconfigurability and high performance for parallel applications. Modern FPGAs can be integrated in computing systems as accelerators so that they can combine with host CPU to execute offload applications. This integration puts more pressure on the fault tolerance of computing systems and the question how to improve the dependability becomes crucial. Similar to CPU-based system, checkpoint/restart techniques are expected to be developed and applied to FPGA-based computing systems. There are two issues rising in this situation: how to checkpoint and restart FPGA, and how this checkpoint/restart model works well with the checkpoint/restart model of the whole computing system. In this paper, first we propose a new checkpoint/restart architecture along with a checkpointing mechanism on FPGA. Second, we propose \"fine-grain\" management for checkpointing to reduce performance degradation. Third, we propose a technique to capture consistent snapshots of FPGA and the rest of the computing system. For host software, we also provide CPRtree stack including API functions to manage checkpoint/restart procedures on FPGA. Our experimental results show that the checkpointing architecture causes up to 9.73% maximum clock frequency degradation, small breakdown, and small data footprint, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).","PeriodicalId":322499,"journal":{"name":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Fourth International Symposium on Computing and Networking (CANDAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CANDAR.2016.0024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

FPGAs provide reconfigurability and high performance for parallel applications. Modern FPGAs can be integrated in computing systems as accelerators so that they can combine with host CPU to execute offload applications. This integration puts more pressure on the fault tolerance of computing systems and the question how to improve the dependability becomes crucial. Similar to CPU-based system, checkpoint/restart techniques are expected to be developed and applied to FPGA-based computing systems. There are two issues rising in this situation: how to checkpoint and restart FPGA, and how this checkpoint/restart model works well with the checkpoint/restart model of the whole computing system. In this paper, first we propose a new checkpoint/restart architecture along with a checkpointing mechanism on FPGA. Second, we propose "fine-grain" management for checkpointing to reduce performance degradation. Third, we propose a technique to capture consistent snapshots of FPGA and the rest of the computing system. For host software, we also provide CPRtree stack including API functions to manage checkpoint/restart procedures on FPGA. Our experimental results show that the checkpointing architecture causes up to 9.73% maximum clock frequency degradation, small breakdown, and small data footprint, while the LUT overhead varies from 17.98% (Dijkstra) to 160.67% (Matrix Multiplication).

查看原文本刊更多论文

CPRtree:一种基于树的异构FPGA计算检查点架构

fpga为并行应用提供可重构性和高性能。现代fpga可以作为加速器集成到计算系统中，这样它们就可以与主机CPU结合执行卸载应用程序。这种集成给计算系统的容错能力带来了更大的压力，如何提高系统的可靠性成为关键问题。与基于cpu的系统类似，检查点/重启技术有望被开发并应用于基于fpga的计算系统。在这种情况下出现了两个问题:如何检查点和重新启动FPGA，以及检查点/重新启动模型如何与整个计算系统的检查点/重新启动模型很好地协同工作。在本文中，我们首先提出了一种新的检查点/重启架构以及FPGA上的检查点机制。其次，我们建议对检查点进行“细粒度”管理，以减少性能下降。第三，我们提出了一种捕获FPGA和计算系统其余部分的一致快照的技术。对于主机软件，我们还提供CPRtree堆栈，包括API函数来管理FPGA上的检查点/重启过程。我们的实验结果表明，检查点架构导致高达9.73%的最大时钟频率退化，小击穿和小数据占用，而LUT开销从17.98% (Dijkstra)到160.67%(矩阵乘法)不等。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 Fourth International Symposium on Computing and Networking (CANDAR)

自引率

0.00%

发文量