A Novel Fault-Tolerant Architecture for Tiled Matrix Multiplication

2023 Design, Automation & Test in Europe Conference & Exhibition (DATE) Pub Date : 2023-04-01 DOI:10.23919/DATE56975.2023.10136985

Sandeep Bal, Chandra sekhar Mummidi, V. C. Ferreira, S. Srinivasan, S. Kundu

{"title":"A Novel Fault-Tolerant Architecture for Tiled Matrix Multiplication","authors":"Sandeep Bal, Chandra sekhar Mummidi, V. C. Ferreira, S. Srinivasan, S. Kundu","doi":"10.23919/DATE56975.2023.10136985","DOIUrl":null,"url":null,"abstract":"General matrix multiplication (GEMM) is common to many scientific and machine-learning applications. Convolution, the dominant computation in Convolutional Neural Networks (CNNs), can be formulated as a GEMM problem. Due to its widespread use, a new generation of processors features GEMM acceleration in hardware. Intel recently announced an Advanced Matrix Multiplication (AMX®) instruction set for GEMM, which is supported by 1kB AMX registers and a Tile Multiplication unit (TMUL) for multiplying tiles (sub-matrices) in hardware. Silent Data Corruption (SDC) is a well-known problem that occurs when hardware generates corrupt output. Google and Meta recently reported findings of SDC in GEMM in their data centers. Algorithm-Based Fault Tolerance (ABFT) is an efficient mechanism for detecting and correcting errors in GEMM, but classic ABFT solutions are not optimized for hardware acceleration. In this paper, we present a novel ABFT implementation directly on hardware. Though the exact implementation of Intel TMUL is not known, we propose two different TMUL architectures representing two design points in the area-power-performance spectrum and illustrate how ABFT can be directly incorporated into the TMUL hardware. This approach has two advantages: (i) an error can be concurrently detected at the tile level, which is an improvement over finding such errors only after performing the full matrix multiplication; and (ii) we further demonstrate that performing ABFT at the hardware level has no performance impact and only a small area, latency, and power overhead.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/DATE56975.2023.10136985","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

General matrix multiplication (GEMM) is common to many scientific and machine-learning applications. Convolution, the dominant computation in Convolutional Neural Networks (CNNs), can be formulated as a GEMM problem. Due to its widespread use, a new generation of processors features GEMM acceleration in hardware. Intel recently announced an Advanced Matrix Multiplication (AMX®) instruction set for GEMM, which is supported by 1kB AMX registers and a Tile Multiplication unit (TMUL) for multiplying tiles (sub-matrices) in hardware. Silent Data Corruption (SDC) is a well-known problem that occurs when hardware generates corrupt output. Google and Meta recently reported findings of SDC in GEMM in their data centers. Algorithm-Based Fault Tolerance (ABFT) is an efficient mechanism for detecting and correcting errors in GEMM, but classic ABFT solutions are not optimized for hardware acceleration. In this paper, we present a novel ABFT implementation directly on hardware. Though the exact implementation of Intel TMUL is not known, we propose two different TMUL architectures representing two design points in the area-power-performance spectrum and illustrate how ABFT can be directly incorporated into the TMUL hardware. This approach has two advantages: (i) an error can be concurrently detected at the tile level, which is an improvement over finding such errors only after performing the full matrix multiplication; and (ii) we further demonstrate that performing ABFT at the hardware level has no performance impact and only a small area, latency, and power overhead.

查看原文本刊更多论文

一种新的平铺矩阵乘法容错体系结构

通用矩阵乘法(GEMM)在许多科学和机器学习应用中都很常见。卷积，卷积神经网络(cnn)的主要计算，可以被表述为一个GEMM问题。由于其广泛使用，新一代处理器在硬件上具有GEMM加速功能。英特尔最近宣布了一种用于GEMM的高级矩阵乘法(AMX®)指令集，该指令集由1kB AMX寄存器和用于在硬件中乘法块(子矩阵)的块乘法单元(TMUL)支持。无声数据损坏(SDC)是硬件产生损坏输出时发生的一个众所周知的问题。Google和Meta最近报告了他们数据中心GEMM中SDC的发现。基于算法的容错(ABFT)是一种有效的GEMM错误检测和纠错机制，但经典的ABFT解决方案并未针对硬件加速进行优化。在本文中，我们提出了一种新的直接在硬件上实现的ABFT。虽然英特尔TMUL的确切实现尚不清楚，但我们提出了两种不同的TMUL架构，代表了面积功率性能谱中的两个设计点，并说明了ABFT如何直接集成到TMUL硬件中。这种方法有两个优点:(i)可以在tile级别同时检测错误，这是一种改进，仅在执行完整矩阵乘法之后才发现此类错误;(ii)我们进一步证明，在硬件级别执行ABFT没有性能影响，只有很小的面积、延迟和功耗开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)

自引率

0.00%

发文量