Sandeep Bal, Chandra sekhar Mummidi, V. C. Ferreira, S. Srinivasan, S. Kundu
{"title":"一种新的平铺矩阵乘法容错体系结构","authors":"Sandeep Bal, Chandra sekhar Mummidi, V. C. Ferreira, S. Srinivasan, S. Kundu","doi":"10.23919/DATE56975.2023.10136985","DOIUrl":null,"url":null,"abstract":"General matrix multiplication (GEMM) is common to many scientific and machine-learning applications. Convolution, the dominant computation in Convolutional Neural Networks (CNNs), can be formulated as a GEMM problem. Due to its widespread use, a new generation of processors features GEMM acceleration in hardware. Intel recently announced an Advanced Matrix Multiplication (AMX®) instruction set for GEMM, which is supported by 1kB AMX registers and a Tile Multiplication unit (TMUL) for multiplying tiles (sub-matrices) in hardware. Silent Data Corruption (SDC) is a well-known problem that occurs when hardware generates corrupt output. Google and Meta recently reported findings of SDC in GEMM in their data centers. Algorithm-Based Fault Tolerance (ABFT) is an efficient mechanism for detecting and correcting errors in GEMM, but classic ABFT solutions are not optimized for hardware acceleration. In this paper, we present a novel ABFT implementation directly on hardware. Though the exact implementation of Intel TMUL is not known, we propose two different TMUL architectures representing two design points in the area-power-performance spectrum and illustrate how ABFT can be directly incorporated into the TMUL hardware. This approach has two advantages: (i) an error can be concurrently detected at the tile level, which is an improvement over finding such errors only after performing the full matrix multiplication; and (ii) we further demonstrate that performing ABFT at the hardware level has no performance impact and only a small area, latency, and power overhead.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Novel Fault-Tolerant Architecture for Tiled Matrix Multiplication\",\"authors\":\"Sandeep Bal, Chandra sekhar Mummidi, V. C. Ferreira, S. Srinivasan, S. Kundu\",\"doi\":\"10.23919/DATE56975.2023.10136985\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"General matrix multiplication (GEMM) is common to many scientific and machine-learning applications. Convolution, the dominant computation in Convolutional Neural Networks (CNNs), can be formulated as a GEMM problem. Due to its widespread use, a new generation of processors features GEMM acceleration in hardware. Intel recently announced an Advanced Matrix Multiplication (AMX®) instruction set for GEMM, which is supported by 1kB AMX registers and a Tile Multiplication unit (TMUL) for multiplying tiles (sub-matrices) in hardware. Silent Data Corruption (SDC) is a well-known problem that occurs when hardware generates corrupt output. Google and Meta recently reported findings of SDC in GEMM in their data centers. Algorithm-Based Fault Tolerance (ABFT) is an efficient mechanism for detecting and correcting errors in GEMM, but classic ABFT solutions are not optimized for hardware acceleration. In this paper, we present a novel ABFT implementation directly on hardware. Though the exact implementation of Intel TMUL is not known, we propose two different TMUL architectures representing two design points in the area-power-performance spectrum and illustrate how ABFT can be directly incorporated into the TMUL hardware. This approach has two advantages: (i) an error can be concurrently detected at the tile level, which is an improvement over finding such errors only after performing the full matrix multiplication; and (ii) we further demonstrate that performing ABFT at the hardware level has no performance impact and only a small area, latency, and power overhead.\",\"PeriodicalId\":340349,\"journal\":{\"name\":\"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/DATE56975.2023.10136985\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/DATE56975.2023.10136985","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Novel Fault-Tolerant Architecture for Tiled Matrix Multiplication
General matrix multiplication (GEMM) is common to many scientific and machine-learning applications. Convolution, the dominant computation in Convolutional Neural Networks (CNNs), can be formulated as a GEMM problem. Due to its widespread use, a new generation of processors features GEMM acceleration in hardware. Intel recently announced an Advanced Matrix Multiplication (AMX®) instruction set for GEMM, which is supported by 1kB AMX registers and a Tile Multiplication unit (TMUL) for multiplying tiles (sub-matrices) in hardware. Silent Data Corruption (SDC) is a well-known problem that occurs when hardware generates corrupt output. Google and Meta recently reported findings of SDC in GEMM in their data centers. Algorithm-Based Fault Tolerance (ABFT) is an efficient mechanism for detecting and correcting errors in GEMM, but classic ABFT solutions are not optimized for hardware acceleration. In this paper, we present a novel ABFT implementation directly on hardware. Though the exact implementation of Intel TMUL is not known, we propose two different TMUL architectures representing two design points in the area-power-performance spectrum and illustrate how ABFT can be directly incorporated into the TMUL hardware. This approach has two advantages: (i) an error can be concurrently detected at the tile level, which is an improvement over finding such errors only after performing the full matrix multiplication; and (ii) we further demonstrate that performing ABFT at the hardware level has no performance impact and only a small area, latency, and power overhead.