Hardening Strategies for HPC Applications

Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD) Pub Date : 2019-11-12 DOI:10.5753/WSCAD_ESTENDIDO.2019.8708

Daniel Oliveira, P. Rech, P. Navaux

{"title":"Hardening Strategies for HPC Applications","authors":"Daniel Oliveira, P. Rech, P. Navaux","doi":"10.5753/WSCAD_ESTENDIDO.2019.8708","DOIUrl":null,"url":null,"abstract":"HPC devices reliability is one of the major concerns for supercomputers today and for the next generation. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this work, we first evaluate the problem by performing radiation experiments. The data from the experiments give us realistic error rate of HPC devices. Moreover, we evaluate a representative set of algorithms deriving general insights of parallel algorithms and programming approaches reliability. To understand better the problem, we propose a novel methodology to go beyond the quantification of the problem. We qualify the error by evaluating the criticality of each corrupted execution through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications output correlating the number of corrupted elements with their spatial locality. We also provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. Furthermore, we designed a homemade fault-injector, CAROL-FI, to understand further the problem by collecting information using fault injection campaigns that is not possible through radiation experiments. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions. This work also evaluates the reliability behaviors of six different architectures, ranging from HPC devices to embedded ones, with the aim to isolate code- and architecture-dependent behaviors. For this evaluation, we present and discuss radiation experiments that cover a total of more than 352,000 years of natural exposure and fault-injection analysis based on a total of more than 120,000 injections. Finally, Error-Correcting Code, Algorithm-Based Fault Tolerance, and Duplication With Comparison hardening strategies are presented and evaluated on HPC devices through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions. Then, we propose and analyze the impact of selective hardening for HPC algorithms. We perform fault-injection campaigns to identify the most critical source code variables and present how to select the best candidates to maximize the reliability/overhead ratio.","PeriodicalId":280012,"journal":{"name":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/WSCAD_ESTENDIDO.2019.8708","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

HPC devices reliability is one of the major concerns for supercomputers today and for the next generation. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this work, we first evaluate the problem by performing radiation experiments. The data from the experiments give us realistic error rate of HPC devices. Moreover, we evaluate a representative set of algorithms deriving general insights of parallel algorithms and programming approaches reliability. To understand better the problem, we propose a novel methodology to go beyond the quantification of the problem. We qualify the error by evaluating the criticality of each corrupted execution through a dedicated set of metrics. We show that, as long as imprecise computing is concerned, the simple mismatch detection is not sufficient to evaluate and compare the radiation sensitivity of HPC devices and algorithms. Our analysis quantifies and qualifies radiation effects on applications output correlating the number of corrupted elements with their spatial locality. We also provide the mean relative error (dataset-wise) to evaluate radiation-induced error magnitude. Furthermore, we designed a homemade fault-injector, CAROL-FI, to understand further the problem by collecting information using fault injection campaigns that is not possible through radiation experiments. We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions. This work also evaluates the reliability behaviors of six different architectures, ranging from HPC devices to embedded ones, with the aim to isolate code- and architecture-dependent behaviors. For this evaluation, we present and discuss radiation experiments that cover a total of more than 352,000 years of natural exposure and fault-injection analysis based on a total of more than 120,000 injections. Finally, Error-Correcting Code, Algorithm-Based Fault Tolerance, and Duplication With Comparison hardening strategies are presented and evaluated on HPC devices through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions. Then, we propose and analyze the impact of selective hardening for HPC algorithms. We perform fault-injection campaigns to identify the most critical source code variables and present how to select the best candidates to maximize the reliability/overhead ratio.

查看原文本刊更多论文

HPC应用的加固策略

高性能计算设备的可靠性是当今和下一代超级计算机的主要关注点之一。事实上，大型数据中心中大量的设备使得至少有一个设备损坏的可能性非常高。在这项工作中，我们首先通过进行辐射实验来评估问题。实验数据给出了HPC器件的实际误差率。此外，我们评估了一组具有代表性的算法，得出了并行算法和编程方法可靠性的一般见解。为了更好地理解这个问题，我们提出了一种超越问题量化的新方法。我们通过一组专门的指标评估每个损坏执行的严重性，从而限定错误。我们表明，只要考虑到不精确计算，简单的失配检测不足以评估和比较HPC设备和算法的辐射灵敏度。我们的分析量化并限定了辐射对应用输出的影响，这些影响与损坏元素的数量及其空间局部性有关。我们还提供了平均相对误差(数据集)来评估辐射引起的误差大小。此外，我们设计了一个自制的故障注入器CAROL-FI，通过使用故障注入活动收集信息来进一步了解问题，这是通过辐射实验无法实现的。我们注入不同的故障模型来分析给定应用的灵敏度。我们展示了应用程序的部分可以根据不同的临界程度进行分级。然后可以根据特定部分的临界程度放松或强化缓解技术。这项工作还评估了六种不同架构的可靠性行为，从高性能计算设备到嵌入式设备，目的是隔离代码和架构相关的行为。为了进行评估，我们提出并讨论了辐射实验，这些实验涵盖了超过352,000年的自然暴露和基于超过120,000次注入的断层注入分析。最后，提出了纠错码、基于算法的容错和复制比较强化策略，并通过辐射实验在高性能计算设备上进行了评估。我们提出并比较了所选加固解决方案的可靠性改进和强加的开销。然后，我们提出并分析了选择性强化对HPC算法的影响。我们执行错误注入活动来识别最关键的源代码变量，并介绍如何选择最佳候选变量以最大化可靠性/开销比。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Anais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)

自引率

0.00%

发文量