Mitigation of fail-stop failures in integer matrix products via numerical packing

Ijeoma Anarado, Y. Andreopoulos
{"title":"Mitigation of fail-stop failures in integer matrix products via numerical packing","authors":"Ijeoma Anarado, Y. Andreopoulos","doi":"10.1109/IOLTS.2015.7229840","DOIUrl":null,"url":null,"abstract":"The decreasing mean-time-to-failure estimates of distributed computing systems indicate that high-performance generic matrix multiply (GEMM) routines running on such environments may need to mitigate an increasing number of fail-stop failures. We propose a new roll-forward solution to this problem that is based on the production of redundant results within the numerical representation of the outputs via the use of numerical packing. This differs from all existing roll-forward solutions that require a separate set of checksum (or duplicate) results. In particular, unlike all existing approaches, the proposed approach does not require additional hardware resources for failure mitigation. Instead, in our proposal the required duplication is inserted in the input matrices themselves. The accommodation of the duplicated inputs imposes 30.6% or 37.5% reduction in the maximum output bitwidth supported in comparison to integer matrix products performed on 32-bit floating-point or integer representations, respectively. Nevertheless, this bitwidth reduction is comparable to the one imposed due to the checksum elements of traditional roll-forward methods, especially for cases where multiple core failures must be mitigated. Experiments performed on an Amazon EC2 instance with 6 Intel Haswell cores dedicated to GEMM computations show that, in comparison to the state-of-the-art failure-intolerant integer GEMM realization, the proposed approach incurs only 5-19.4% drop in the achievable peak performance. This overhead is significantly lower than the 33.3 - 37% overhead incurred by the equivalent checksum-based method.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IOLTS.2015.7229840","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The decreasing mean-time-to-failure estimates of distributed computing systems indicate that high-performance generic matrix multiply (GEMM) routines running on such environments may need to mitigate an increasing number of fail-stop failures. We propose a new roll-forward solution to this problem that is based on the production of redundant results within the numerical representation of the outputs via the use of numerical packing. This differs from all existing roll-forward solutions that require a separate set of checksum (or duplicate) results. In particular, unlike all existing approaches, the proposed approach does not require additional hardware resources for failure mitigation. Instead, in our proposal the required duplication is inserted in the input matrices themselves. The accommodation of the duplicated inputs imposes 30.6% or 37.5% reduction in the maximum output bitwidth supported in comparison to integer matrix products performed on 32-bit floating-point or integer representations, respectively. Nevertheless, this bitwidth reduction is comparable to the one imposed due to the checksum elements of traditional roll-forward methods, especially for cases where multiple core failures must be mitigated. Experiments performed on an Amazon EC2 instance with 6 Intel Haswell cores dedicated to GEMM computations show that, in comparison to the state-of-the-art failure-intolerant integer GEMM realization, the proposed approach incurs only 5-19.4% drop in the achievable peak performance. This overhead is significantly lower than the 33.3 - 37% overhead incurred by the equivalent checksum-based method.
整数矩阵乘积的数值包装失效缓解
分布式计算系统平均故障时间估计的减少表明,在这种环境中运行的高性能通用矩阵乘法(GEMM)例程可能需要减轻越来越多的故障停止故障。我们提出了一种新的前滚解决方案,该解决方案基于通过使用数值包装在输出的数值表示中产生冗余结果。这不同于所有现有的前滚解决方案,后者需要一组单独的校验和(或重复)结果。特别是,与所有现有方法不同,提议的方法不需要额外的硬件资源来减少故障。相反,在我们的建议中,所需的重复被插入到输入矩阵本身中。与在32位浮点或整数表示上执行的整数矩阵乘积相比,容纳重复输入所支持的最大输出位宽分别减少了30.6%或37.5%。然而,这种位宽减少与传统前滚方法的校验和元素所造成的减少相当,特别是在必须减轻多核故障的情况下。在Amazon EC2实例上使用6个Intel Haswell内核进行GEMM计算的实验表明,与最先进的不可容错整数GEMM实现相比,所提出的方法在可实现的峰值性能上仅下降了5-19.4%。这个开销明显低于基于校验和的等效方法所产生的33.3 - 37%的开销。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信