Algorithm Selection for Error Resilience in Scientific Computing

2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing Pub Date : 2014-11-18 DOI:10.1109/PRDC.2014.20

Joseph Callenes-Sloan, H. McNamara

{"title":"Algorithm Selection for Error Resilience in Scientific Computing","authors":"Joseph Callenes-Sloan, H. McNamara","doi":"10.1109/PRDC.2014.20","DOIUrl":null,"url":null,"abstract":"With process scaling and the adoption of post-cmos technologies, reliability and power are becoming a significant concern for future computing systems, especially highly parallel systems. Previous approaches have investigated augmenting applications with additional logic to detect and correct errors efficiently. In this research, we investigate the impact of different algorithmic designs on error resilience and propose an approach for algorithm selection for a class of equations, i.e. partial differential equations (PDEs), that are at the core of many scientific computing applications, which drive HPC systems. Many different schemes have been devised for the approximation of PDE systems, each with different accuracy, stability, and performance properties. In this research, there are two primary questions that we address: (1) Does numerical stability translate to error resilience? and (2) How do we design schemes to improve error resilience? If an algorithm's error resilience is correlated with its numerical stability properties, this may allow us to design more resilient applications by leveraging well established information on numerical stability. Even with a clear translation of numerical stability to error resilience properties, the question of designing these algorithms still remains however, due to the variety of implementations, schemes, and largely input specific nature of the design. In this research, we propose one approach for automated design using machine-learning. We observe that intelligent selection of the algorithm or a given problem, improves robustness by 20%-50%, on average, over the traditional selection of algorithms, without the addition of any other detection/correction logic.","PeriodicalId":187000,"journal":{"name":"2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.2014.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

With process scaling and the adoption of post-cmos technologies, reliability and power are becoming a significant concern for future computing systems, especially highly parallel systems. Previous approaches have investigated augmenting applications with additional logic to detect and correct errors efficiently. In this research, we investigate the impact of different algorithmic designs on error resilience and propose an approach for algorithm selection for a class of equations, i.e. partial differential equations (PDEs), that are at the core of many scientific computing applications, which drive HPC systems. Many different schemes have been devised for the approximation of PDE systems, each with different accuracy, stability, and performance properties. In this research, there are two primary questions that we address: (1) Does numerical stability translate to error resilience? and (2) How do we design schemes to improve error resilience? If an algorithm's error resilience is correlated with its numerical stability properties, this may allow us to design more resilient applications by leveraging well established information on numerical stability. Even with a clear translation of numerical stability to error resilience properties, the question of designing these algorithms still remains however, due to the variety of implementations, schemes, and largely input specific nature of the design. In this research, we propose one approach for automated design using machine-learning. We observe that intelligent selection of the algorithm or a given problem, improves robustness by 20%-50%, on average, over the traditional selection of algorithms, without the addition of any other detection/correction logic.

查看原文本刊更多论文

科学计算中错误恢复的算法选择

随着工艺的扩展和后cmos技术的采用，可靠性和功耗正在成为未来计算系统，特别是高度并行系统的重要关注点。以前的方法研究了用额外的逻辑来增加应用程序以有效地检测和纠正错误。在这项研究中，我们研究了不同的算法设计对错误恢复能力的影响，并提出了一类方程的算法选择方法，即偏微分方程(PDEs)，这是许多驱动HPC系统的科学计算应用的核心。对于PDE系统的逼近，已经设计了许多不同的方案，每种方案都具有不同的精度、稳定性和性能。在这项研究中，有两个主要问题，我们解决:(1)数值稳定性转化为错误弹性?(2)我们如何设计方案来提高错误恢复能力?如果算法的错误弹性与其数值稳定性属性相关，这可能允许我们通过利用关于数值稳定性的良好建立的信息来设计更具弹性的应用程序。即使将数值稳定性清楚地转换为错误恢复性能，由于各种实现、方案和设计的主要输入特定性质，设计这些算法的问题仍然存在。在这项研究中，我们提出了一种使用机器学习进行自动化设计的方法。我们观察到，与传统的算法选择相比，算法或给定问题的智能选择平均提高了20%-50%的鲁棒性，而无需添加任何其他检测/校正逻辑。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing

自引率

0.00%

发文量