Mitigation of fail-stop failures in integer matrix products via numerical packing

2015 IEEE 21st International On-Line Testing Symposium (IOLTS) Pub Date : 2015-07-06 DOI:10.1109/IOLTS.2015.7229840

Ijeoma Anarado, Y. Andreopoulos

{"title":"Mitigation of fail-stop failures in integer matrix products via numerical packing","authors":"Ijeoma Anarado, Y. Andreopoulos","doi":"10.1109/IOLTS.2015.7229840","DOIUrl":null,"url":null,"abstract":"The decreasing mean-time-to-failure estimates of distributed computing systems indicate that high-performance generic matrix multiply (GEMM) routines running on such environments may need to mitigate an increasing number of fail-stop failures. We propose a new roll-forward solution to this problem that is based on the production of redundant results within the numerical representation of the outputs via the use of numerical packing. This differs from all existing roll-forward solutions that require a separate set of checksum (or duplicate) results. In particular, unlike all existing approaches, the proposed approach does not require additional hardware resources for failure mitigation. Instead, in our proposal the required duplication is inserted in the input matrices themselves. The accommodation of the duplicated inputs imposes 30.6% or 37.5% reduction in the maximum output bitwidth supported in comparison to integer matrix products performed on 32-bit floating-point or integer representations, respectively. Nevertheless, this bitwidth reduction is comparable to the one imposed due to the checksum elements of traditional roll-forward methods, especially for cases where multiple core failures must be mitigated. Experiments performed on an Amazon EC2 instance with 6 Intel Haswell cores dedicated to GEMM computations show that, in comparison to the state-of-the-art failure-intolerant integer GEMM realization, the proposed approach incurs only 5-19.4% drop in the achievable peak performance. This overhead is significantly lower than the 33.3 - 37% overhead incurred by the equivalent checksum-based method.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IOLTS.2015.7229840","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The decreasing mean-time-to-failure estimates of distributed computing systems indicate that high-performance generic matrix multiply (GEMM) routines running on such environments may need to mitigate an increasing number of fail-stop failures. We propose a new roll-forward solution to this problem that is based on the production of redundant results within the numerical representation of the outputs via the use of numerical packing. This differs from all existing roll-forward solutions that require a separate set of checksum (or duplicate) results. In particular, unlike all existing approaches, the proposed approach does not require additional hardware resources for failure mitigation. Instead, in our proposal the required duplication is inserted in the input matrices themselves. The accommodation of the duplicated inputs imposes 30.6% or 37.5% reduction in the maximum output bitwidth supported in comparison to integer matrix products performed on 32-bit floating-point or integer representations, respectively. Nevertheless, this bitwidth reduction is comparable to the one imposed due to the checksum elements of traditional roll-forward methods, especially for cases where multiple core failures must be mitigated. Experiments performed on an Amazon EC2 instance with 6 Intel Haswell cores dedicated to GEMM computations show that, in comparison to the state-of-the-art failure-intolerant integer GEMM realization, the proposed approach incurs only 5-19.4% drop in the achievable peak performance. This overhead is significantly lower than the 33.3 - 37% overhead incurred by the equivalent checksum-based method.

查看原文本刊更多论文

整数矩阵乘积的数值包装失效缓解

分布式计算系统平均故障时间估计的减少表明，在这种环境中运行的高性能通用矩阵乘法(GEMM)例程可能需要减轻越来越多的故障停止故障。我们提出了一种新的前滚解决方案，该解决方案基于通过使用数值包装在输出的数值表示中产生冗余结果。这不同于所有现有的前滚解决方案，后者需要一组单独的校验和(或重复)结果。特别是，与所有现有方法不同，提议的方法不需要额外的硬件资源来减少故障。相反，在我们的建议中，所需的重复被插入到输入矩阵本身中。与在32位浮点或整数表示上执行的整数矩阵乘积相比，容纳重复输入所支持的最大输出位宽分别减少了30.6%或37.5%。然而，这种位宽减少与传统前滚方法的校验和元素所造成的减少相当，特别是在必须减轻多核故障的情况下。在Amazon EC2实例上使用6个Intel Haswell内核进行GEMM计算的实验表明，与最先进的不可容错整数GEMM实现相比，所提出的方法在可实现的峰值性能上仅下降了5-19.4%。这个开销明显低于基于校验和的等效方法所产生的33.3 - 37%的开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE 21st International On-Line Testing Symposium (IOLTS)

自引率

0.00%

发文量