Highly Efficient Self-Checking Matrix Multiplication on Tiled AMX Accelerators

IF 1.8 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-11-22 DOI:10.1145/3633332

Chandra Sekhar Mummidi, Victor C. Ferreira, Sudarshan Srinivasan, Sandip Kundu

{"title":"Highly Efficient Self-Checking Matrix Multiplication on Tiled AMX Accelerators","authors":"Chandra Sekhar Mummidi, Victor C. Ferreira, Sudarshan Srinivasan, Sandip Kundu","doi":"10.1145/3633332","DOIUrl":null,"url":null,"abstract":"<p>General Matrix Multiplication (GEMM) is a computationally expensive operation that is used in many applications such as machine-learning. Hardware accelerators are increasingly popular for speeding up GEMM computation, with Tiled Matrix Multiplication (TMUL) in recent Intel processors being an example. Unfortunately, the TMUL hardware is susceptible to errors necessitating online error detection. Algorithm-based Error Detection techniques (ABED) is a powerful technique to detect errors in matrix multiplications. In this paper, we consider implementation of ABED that integrates seamlessly with the TMUL hardware to minimize performance overhead. Unfortunately, rounding errors introduced by floating-point operations do not allow a straightforward implementation of ABED in TMUL. Previously an error bound was considered for addressing rounding errors in ABED. If the error detection threshold is set too low, it will trigger false alarm while a loose bound will allow errors to escape detection. In this paper, we propose an adaptive error threshold that takes into account the TMUL input values to address the problem of false triggers and error escapes, and provide a taxonomy of various error classes. This threshold is obtained from theoretical error analysis but is not easy to implement in hardware. Consequently, we relax the threshold such that it can be easily computed in hardware. While ABED ensures error free computation it does not guarantee full coverage of all hardware faults. To address this problem, we propose an algorithmic pattern-generation technique to ensure full coverage for all hardware faults. To evaluate the benefits of our proposed solution, we conducted fault injection experiments and show that our approach does not produce any false alarms or detection escapes for observable errors. We conducted additional fault injection experiments on a Deep Neural Network (DNN) model and find that if a fault is not detected, it does not cause any misclassification.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"14 2 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3633332","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

General Matrix Multiplication (GEMM) is a computationally expensive operation that is used in many applications such as machine-learning. Hardware accelerators are increasingly popular for speeding up GEMM computation, with Tiled Matrix Multiplication (TMUL) in recent Intel processors being an example. Unfortunately, the TMUL hardware is susceptible to errors necessitating online error detection. Algorithm-based Error Detection techniques (ABED) is a powerful technique to detect errors in matrix multiplications. In this paper, we consider implementation of ABED that integrates seamlessly with the TMUL hardware to minimize performance overhead. Unfortunately, rounding errors introduced by floating-point operations do not allow a straightforward implementation of ABED in TMUL. Previously an error bound was considered for addressing rounding errors in ABED. If the error detection threshold is set too low, it will trigger false alarm while a loose bound will allow errors to escape detection. In this paper, we propose an adaptive error threshold that takes into account the TMUL input values to address the problem of false triggers and error escapes, and provide a taxonomy of various error classes. This threshold is obtained from theoretical error analysis but is not easy to implement in hardware. Consequently, we relax the threshold such that it can be easily computed in hardware. While ABED ensures error free computation it does not guarantee full coverage of all hardware faults. To address this problem, we propose an algorithmic pattern-generation technique to ensure full coverage for all hardware faults. To evaluate the benefits of our proposed solution, we conducted fault injection experiments and show that our approach does not produce any false alarms or detection escapes for observable errors. We conducted additional fault injection experiments on a Deep Neural Network (DNN) model and find that if a fault is not detected, it does not cause any misclassification.

查看原文本刊更多论文

在平铺AMX加速器上的高效自检矩阵乘法

通用矩阵乘法(GEMM)是一种计算成本很高的操作，用于许多应用程序，如机器学习。硬件加速器在加速GEMM计算方面越来越受欢迎，最近的英特尔处理器中的平铺矩阵乘法(TMUL)就是一个例子。不幸的是，TMUL硬件容易出错，因此需要进行在线错误检测。基于算法的错误检测技术(ABED)是一种检测矩阵乘法错误的强大技术。在本文中，我们考虑与TMUL硬件无缝集成的ABED实现，以最小化性能开销。不幸的是，浮点运算引入的舍入误差不允许在TMUL中直接实现ABED。以前，为了解决ABED中的舍入错误，考虑了一个错误边界。如果错误检测阈值设置过低，将触发虚警，而松散的绑定将允许错误逃避检测。在本文中，我们提出了一个考虑TMUL输入值的自适应错误阈值，以解决错误触发和错误转义的问题，并提供了各种错误类的分类。该阈值由理论误差分析得到，但在硬件上不易实现。因此，我们放宽阈值，使其可以在硬件中轻松计算。虽然ABED确保无错误计算，但它不能保证完全覆盖所有硬件故障。为了解决这个问题，我们提出了一种算法模式生成技术，以确保完全覆盖所有硬件故障。为了评估我们提出的解决方案的好处，我们进行了故障注入实验，并表明我们的方法不会对可观察到的错误产生任何假警报或检测逃逸。我们在深度神经网络(DNN)模型上进行了额外的故障注入实验，发现如果未检测到故障，则不会导致任何错误分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Architecture and Code Optimization 工程技术-计算机：理论方法

CiteScore

3.60

自引率

6.20%

发文量

审稿时长

6-12 weeks

期刊介绍： ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.