SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems

J. Comput. Sci. Technol. Pub Date : 2020-10-29 DOI:10.24215/16666038.20.e14

Diego Montezanti

{"title":"SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems","authors":"Diego Montezanti","doi":"10.24215/16666038.20.e14","DOIUrl":null,"url":null,"abstract":" \nReliability and fault tolerance have become aspects of growing relevance in the field of HPC, due to the increased probability that faults of different kinds will occur in these systems. This is fundamentally due to the increasing complexity of the processors, in the search to improve performance, which leads to a rise in the scale of integration and in the number of components that work near their technological limits, being increasingly prone to failures. Another factor that affects is the growth in the size of parallel systems to obtain greater computational power, in terms of number of cores and processing nodes. \nAs applications demand longer uninterrupted computation times, the impact of faults grows, due to the cost of relaunching an execution that was aborted due to the occurrence of a fault or concluded with erroneous results. Consequently, it is necessary to run these applications on highly available and reliable systems, requiring strategies capable of providing detection, protection and recovery against faults. \nIn the next years it is planned to reach Exa-scale, in which there will be supercomputers with millions of processing cores, capable of performing on the order of 1018 operations per second. This is a great window of opportunity for HPC applications, but it also increases the risk that they will not complete their executions. Recent studies show that, as systems continue to include more processors, the Mean Time Between Errors decreases, resulting in higher failure rates and increased risk of corrupted results; large parallel applications are expected to deal with errors that occur every few minutes, requiring external help to progress efficiently. Silent Data Corruptions are the most dangerous errors that can occur, since they can generate incorrect results in programs that appear to execute correctly. Scientific applications and large-scale simulations are the most affected, making silent error handling the main challenge towards resilience in HPC. In message passing applications, a silent error, affecting a single task, can produce a pattern of corruption that spreads to all communicating processes; in the worst case scenario, the erroneous final results cannot be detected at the end of the execution and will be taken as correct. \nSince scientific applications have execution times of the order of hours or even days, it is essential to find strategies that allow applications to reach correct solutions in a bounded time, despite the underlying failures. These strategies also prevent energy consumption from skyrocketing, since if they are not used, the executions should be launched again from the beginning. However, the most popular parallel programming models used in supercomputers lack support for fault tolerance.","PeriodicalId":188846,"journal":{"name":"J. Comput. Sci. Technol.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Comput. Sci. Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.24215/16666038.20.e14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Reliability and fault tolerance have become aspects of growing relevance in the field of HPC, due to the increased probability that faults of different kinds will occur in these systems. This is fundamentally due to the increasing complexity of the processors, in the search to improve performance, which leads to a rise in the scale of integration and in the number of components that work near their technological limits, being increasingly prone to failures. Another factor that affects is the growth in the size of parallel systems to obtain greater computational power, in terms of number of cores and processing nodes. As applications demand longer uninterrupted computation times, the impact of faults grows, due to the cost of relaunching an execution that was aborted due to the occurrence of a fault or concluded with erroneous results. Consequently, it is necessary to run these applications on highly available and reliable systems, requiring strategies capable of providing detection, protection and recovery against faults. In the next years it is planned to reach Exa-scale, in which there will be supercomputers with millions of processing cores, capable of performing on the order of 1018 operations per second. This is a great window of opportunity for HPC applications, but it also increases the risk that they will not complete their executions. Recent studies show that, as systems continue to include more processors, the Mean Time Between Errors decreases, resulting in higher failure rates and increased risk of corrupted results; large parallel applications are expected to deal with errors that occur every few minutes, requiring external help to progress efficiently. Silent Data Corruptions are the most dangerous errors that can occur, since they can generate incorrect results in programs that appear to execute correctly. Scientific applications and large-scale simulations are the most affected, making silent error handling the main challenge towards resilience in HPC. In message passing applications, a silent error, affecting a single task, can produce a pattern of corruption that spreads to all communicating processes; in the worst case scenario, the erroneous final results cannot be detected at the end of the execution and will be taken as correct. Since scientific applications have execution times of the order of hours or even days, it is essential to find strategies that allow applications to reach correct solutions in a bounded time, despite the underlying failures. These strategies also prevent energy consumption from skyrocketing, since if they are not used, the executions should be launched again from the beginning. However, the most popular parallel programming models used in supercomputers lack support for fault tolerance.

查看原文本刊更多论文

高性能计算系统中的软错误检测和自动恢复

由于在高性能计算系统中发生各种故障的可能性越来越大，可靠性和容错已经成为高性能计算领域日益重要的方面。这主要是由于处理器的复杂性不断增加，为了提高性能，这导致集成规模的增加和工作在其技术极限附近的组件数量的增加，越来越容易出现故障。另一个影响因素是并行系统规模的增长，以获得更大的计算能力，就核心和处理节点的数量而言。由于应用程序需要更长的不间断计算时间，由于重新启动由于发生错误或以错误结果结束而中止的执行的成本，错误的影响会增加。因此，有必要在高可用性和可靠的系统上运行这些应用程序，这需要能够针对故障提供检测、保护和恢复的策略。在接下来的几年里，它计划达到超大规模，届时将有数百万个处理核心的超级计算机，每秒可以执行1018次操作。这对HPC应用程序来说是一个巨大的机会窗口，但它也增加了它们无法完成执行的风险。最近的研究表明，随着系统继续包含更多的处理器，平均错误间隔时间减少，导致更高的故障率和增加损坏结果的风险;大型并行应用程序需要处理每隔几分钟发生一次的错误，需要外部帮助才能有效地进行。静默数据损坏是可能发生的最危险的错误，因为它们可能在看似正确执行的程序中生成不正确的结果。科学应用和大规模模拟受影响最大，使得无声错误处理成为高性能计算弹性的主要挑战。在消息传递应用程序中，影响单个任务的无声错误可能会产生一种破坏模式，并蔓延到所有通信进程;在最坏的情况下，在执行结束时无法检测到错误的最终结果，并将其视为正确的。由于科学应用程序的执行时间为数小时甚至数天，因此找到允许应用程序在有限时间内(尽管存在潜在故障)获得正确解决方案的策略是至关重要的。这些策略还可以防止能源消耗飙升，因为如果不使用它们，就应该重新开始执行。然而，在超级计算机中使用的最流行的并行编程模型缺乏对容错的支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

J. Comput. Sci. Technol.

自引率

0.00%

发文量