Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI:10.1109/HPCA47549.2020.00014

Jingwen Leng, A. Buyuktosunoglu, Ramon Bertran Monfort, P. Bose, Quan Chen, M. Guo, V. Reddi

{"title":"Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems","authors":"Jingwen Leng, A. Buyuktosunoglu, Ramon Bertran Monfort, P. Bose, Quan Chen, M. Guo, V. Reddi","doi":"10.1109/HPCA47549.2020.00014","DOIUrl":null,"url":null,"abstract":"Accelerators make the task of building systems that are re-silient against transient errors like voltage noise and soft errors hard. Architects integrate accelerators into the system as black box third-party IP components. So a fault in one or more accelerators may threaten the system's reliability if there are no established failure semantics for how an error propagates from the accelerator to the main CPU. Existing solutions that assure system reliability come at the cost of sacrificing accelerator generality, efficiency, and incur significant overhead, even in the absence of errors. To over-come these drawbacks, we examine reliability management of accelerator systems via hardware-software co-design, coupling an efficient architecture design with compiler and run-time support, to cope with transient errors. We introduce asymmetric resilience that architects reliability at the system level, centered around a hardened CPU, rather than at the accelerator level. At runtime, the system exploits task-level idempotency to contain accelerator errors and use memory protection instead of taking checkpoints to mitigate over-heads. We also leverage the fact that errors rarely occur in systems, and exploit the trade-off between error recovery performance and improved error-free performance to enhance system efficiency. Using GPUs, which are at the fore-front of accelerator systems, we demonstrate how our system architecture manages reliability in both integrated and discrete systems, under voltage-noise and soft-error related faults, leading to extremely low overhead (less than 1%) and substantial gains (20% energy savings on average).","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA47549.2020.00014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Accelerators make the task of building systems that are re-silient against transient errors like voltage noise and soft errors hard. Architects integrate accelerators into the system as black box third-party IP components. So a fault in one or more accelerators may threaten the system's reliability if there are no established failure semantics for how an error propagates from the accelerator to the main CPU. Existing solutions that assure system reliability come at the cost of sacrificing accelerator generality, efficiency, and incur significant overhead, even in the absence of errors. To over-come these drawbacks, we examine reliability management of accelerator systems via hardware-software co-design, coupling an efficient architecture design with compiler and run-time support, to cope with transient errors. We introduce asymmetric resilience that architects reliability at the system level, centered around a hardened CPU, rather than at the accelerator level. At runtime, the system exploits task-level idempotency to contain accelerator errors and use memory protection instead of taking checkpoints to mitigate over-heads. We also leverage the fact that errors rarely occur in systems, and exploit the trade-off between error recovery performance and improved error-free performance to enhance system efficiency. Using GPUs, which are at the fore-front of accelerator systems, we demonstrate how our system architecture manages reliability in both integrated and discrete systems, under voltage-noise and soft-error related faults, leading to extremely low overhead (less than 1%) and substantial gains (20% energy savings on average).

查看原文本刊更多论文

非对称弹性:利用基于加速器的系统中瞬态错误恢复的任务级等幂

加速器使得构建能够抵御电压噪声和软错误等瞬态错误的系统的任务变得困难。架构师将加速器作为黑盒第三方IP组件集成到系统中。因此，如果没有建立错误如何从加速器传播到主CPU的故障语义，那么一个或多个加速器中的故障可能会威胁到系统的可靠性。确保系统可靠性的现有解决方案是以牺牲加速器的通用性和效率为代价的，并且即使在没有错误的情况下也会产生巨大的开销。为了克服这些缺点，我们通过硬件软件协同设计来研究加速器系统的可靠性管理，将有效的架构设计与编译器和运行时支持相结合，以应对瞬态错误。我们引入了不对称弹性，在系统级别构建可靠性，以强化的CPU为中心，而不是在加速器级别。在运行时，系统利用任务级别的幂等性来包含加速器错误，并使用内存保护而不是检查点来减少开销。我们还利用系统中很少发生错误的事实，并利用错误恢复性能和改进的无错误性能之间的权衡来提高系统效率。使用gpu，这是加速器系统的前沿，我们展示了我们的系统架构如何管理集成和离散系统的可靠性，在电压噪声和软误差相关故障下，导致极低的开销(小于1%)和可观的收益(平均节省20%的能源)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量