Tian Chen , Yu-an Tan , Thar Baker , Haokai Wu , Qiuyu Zhang , Yuanzhang Li
{"title":"ConCeal: A Winograd convolution code template for optimising GCU in parallel","authors":"Tian Chen , Yu-an Tan , Thar Baker , Haokai Wu , Qiuyu Zhang , Yuanzhang Li","doi":"10.1016/j.jpdc.2025.105108","DOIUrl":null,"url":null,"abstract":"<div><div>By minimising arithmetic operations, Winograd convolution substantially reduces the computational complexity of convolution, a pivotal operation in the training and inference stages of Convolutional Neural Networks (CNNs). This study leverages the hardware architecture and capabilities of Shanghai Enflame Technology's AI accelerator, the General Computing Unit (GCU). We develop a code template named ConCeal for Winograd convolution with 3 × 3 kernels, employing a set of interrelated optimisations, including task partitioning, memory layout design, and parallelism. These optimisations fully exploit GCU's computing resources by optimising dataflow and parallelizing the execution of tasks on GCU cores, thereby enhancing Winograd convolution. Moreover, the integrated optimisations in the template are efficiently applicable to other operators, such as max pooling. Using this template, we implement and assess the performance of four Winograd convolution operators on GCU. The experimental results showcase that Conceal operators achieve a maximum of 2.04× and an average of 1.49× speedup compared to the fastest GEMM-based convolution implementations on GCU. Additionally, the ConCeal operators demonstrate competitive or superior computing resource utilisation in certain ResNet and VGG convolution layers when compared to cuDNN on RTX2080.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"203 ","pages":"Article 105108"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Parallel and Distributed Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0743731525000759","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
By minimising arithmetic operations, Winograd convolution substantially reduces the computational complexity of convolution, a pivotal operation in the training and inference stages of Convolutional Neural Networks (CNNs). This study leverages the hardware architecture and capabilities of Shanghai Enflame Technology's AI accelerator, the General Computing Unit (GCU). We develop a code template named ConCeal for Winograd convolution with 3 × 3 kernels, employing a set of interrelated optimisations, including task partitioning, memory layout design, and parallelism. These optimisations fully exploit GCU's computing resources by optimising dataflow and parallelizing the execution of tasks on GCU cores, thereby enhancing Winograd convolution. Moreover, the integrated optimisations in the template are efficiently applicable to other operators, such as max pooling. Using this template, we implement and assess the performance of four Winograd convolution operators on GCU. The experimental results showcase that Conceal operators achieve a maximum of 2.04× and an average of 1.49× speedup compared to the fastest GEMM-based convolution implementations on GCU. Additionally, the ConCeal operators demonstrate competitive or superior computing resource utilisation in certain ResNet and VGG convolution layers when compared to cuDNN on RTX2080.
期刊介绍:
This international journal is directed to researchers, engineers, educators, managers, programmers, and users of computers who have particular interests in parallel processing and/or distributed computing.
The Journal of Parallel and Distributed Computing publishes original research papers and timely review articles on the theory, design, evaluation, and use of parallel and/or distributed computing systems. The journal also features special issues on these topics; again covering the full range from the design to the use of our targeted systems.