{"title":"Mitigation of fail-stop failures in integer matrix products via numerical packing","authors":"Ijeoma Anarado, Y. Andreopoulos","doi":"10.1109/IOLTS.2015.7229840","DOIUrl":null,"url":null,"abstract":"The decreasing mean-time-to-failure estimates of distributed computing systems indicate that high-performance generic matrix multiply (GEMM) routines running on such environments may need to mitigate an increasing number of fail-stop failures. We propose a new roll-forward solution to this problem that is based on the production of redundant results within the numerical representation of the outputs via the use of numerical packing. This differs from all existing roll-forward solutions that require a separate set of checksum (or duplicate) results. In particular, unlike all existing approaches, the proposed approach does not require additional hardware resources for failure mitigation. Instead, in our proposal the required duplication is inserted in the input matrices themselves. The accommodation of the duplicated inputs imposes 30.6% or 37.5% reduction in the maximum output bitwidth supported in comparison to integer matrix products performed on 32-bit floating-point or integer representations, respectively. Nevertheless, this bitwidth reduction is comparable to the one imposed due to the checksum elements of traditional roll-forward methods, especially for cases where multiple core failures must be mitigated. Experiments performed on an Amazon EC2 instance with 6 Intel Haswell cores dedicated to GEMM computations show that, in comparison to the state-of-the-art failure-intolerant integer GEMM realization, the proposed approach incurs only 5-19.4% drop in the achievable peak performance. This overhead is significantly lower than the 33.3 - 37% overhead incurred by the equivalent checksum-based method.","PeriodicalId":413023,"journal":{"name":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International On-Line Testing Symposium (IOLTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IOLTS.2015.7229840","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The decreasing mean-time-to-failure estimates of distributed computing systems indicate that high-performance generic matrix multiply (GEMM) routines running on such environments may need to mitigate an increasing number of fail-stop failures. We propose a new roll-forward solution to this problem that is based on the production of redundant results within the numerical representation of the outputs via the use of numerical packing. This differs from all existing roll-forward solutions that require a separate set of checksum (or duplicate) results. In particular, unlike all existing approaches, the proposed approach does not require additional hardware resources for failure mitigation. Instead, in our proposal the required duplication is inserted in the input matrices themselves. The accommodation of the duplicated inputs imposes 30.6% or 37.5% reduction in the maximum output bitwidth supported in comparison to integer matrix products performed on 32-bit floating-point or integer representations, respectively. Nevertheless, this bitwidth reduction is comparable to the one imposed due to the checksum elements of traditional roll-forward methods, especially for cases where multiple core failures must be mitigated. Experiments performed on an Amazon EC2 instance with 6 Intel Haswell cores dedicated to GEMM computations show that, in comparison to the state-of-the-art failure-intolerant integer GEMM realization, the proposed approach incurs only 5-19.4% drop in the achievable peak performance. This overhead is significantly lower than the 33.3 - 37% overhead incurred by the equivalent checksum-based method.