优化分块整数 GeMM，在可扩展的无序矢量处理器上高效部署 DNN

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture Pub Date : 2024-07-08 DOI:10.1016/j.sysarc.2024.103236

Nitish Satya Murthy, Francky Catthoor, Marian Verhelst

{"title":"优化分块整数 GeMM，在可扩展的无序矢量处理器上高效部署 DNN","authors":"Nitish Satya Murthy, Francky Catthoor, Marian Verhelst","doi":"10.1016/j.sysarc.2024.103236","DOIUrl":null,"url":null,"abstract":"<div><p>A continuing rise in DNN usage in distributed and embedded use cases has demanded more efficient hardware execution in the field. Low-precision GeMMs with optimized data formats have played a key role in more memory and computationally-efficient networks. Recently trending formats are block-scaled representations stemming from tight HW-SW co-optimization, that compress network size by sharing exponents per data block. Prior work mostly focuses on deploying such block-scaled GeMM operations on domain-specific accelerators for optimum efficiency at the cost of flexibility and ease of deployment. In this work, we exploit and optimize the deployment of block-scaled GeMMs on fully-programmable in-order vector processors using ARM SVE. We define a systematic methodology for performing design space exploration to optimally match the workload specifications with processor vector-lengths, different microkernels, block sizes and shapes. We introduce efficient intrinsics-based microkernels with effective loop unrollings, and data-transfer efficient fused requantization strategies to maximize kernel performance, while also ensuring several deployment configurations. We enable generalized block-scaled kernel deployments through tunable block sizes and shapes, which helps in accommodating different accuracy-speed trade-off requirements. Utilizing 2D activation blocks instead of conventional 1D blocks, the static and dynamic BS-INT8 configurations yielded on average 3.8x and 2.9x faster speedups over FP32 models respectively, at no accuracy loss for CNN classification tasks on CIFAR10/100 datasets.</p></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"154 ","pages":"Article 103236"},"PeriodicalIF":3.7000,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimization of block-scaled integer GeMMs for efficient DNN deployment on scalable in-order vector processors\",\"authors\":\"Nitish Satya Murthy, Francky Catthoor, Marian Verhelst\",\"doi\":\"10.1016/j.sysarc.2024.103236\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>A continuing rise in DNN usage in distributed and embedded use cases has demanded more efficient hardware execution in the field. Low-precision GeMMs with optimized data formats have played a key role in more memory and computationally-efficient networks. Recently trending formats are block-scaled representations stemming from tight HW-SW co-optimization, that compress network size by sharing exponents per data block. Prior work mostly focuses on deploying such block-scaled GeMM operations on domain-specific accelerators for optimum efficiency at the cost of flexibility and ease of deployment. In this work, we exploit and optimize the deployment of block-scaled GeMMs on fully-programmable in-order vector processors using ARM SVE. We define a systematic methodology for performing design space exploration to optimally match the workload specifications with processor vector-lengths, different microkernels, block sizes and shapes. We introduce efficient intrinsics-based microkernels with effective loop unrollings, and data-transfer efficient fused requantization strategies to maximize kernel performance, while also ensuring several deployment configurations. We enable generalized block-scaled kernel deployments through tunable block sizes and shapes, which helps in accommodating different accuracy-speed trade-off requirements. Utilizing 2D activation blocks instead of conventional 1D blocks, the static and dynamic BS-INT8 configurations yielded on average 3.8x and 2.9x faster speedups over FP32 models respectively, at no accuracy loss for CNN classification tasks on CIFAR10/100 datasets.</p></div>\",\"PeriodicalId\":50027,\"journal\":{\"name\":\"Journal of Systems Architecture\",\"volume\":\"154 \",\"pages\":\"Article 103236\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems Architecture\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1383762124001735\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762124001735","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

在分布式和嵌入式使用案例中，DNN 的使用率持续上升，这就要求在现场使用更高效的硬件执行。具有优化数据格式的低精度 GeMM 在提高网络内存和计算效率方面发挥了关键作用。最近流行的格式是块缩放表示法，它源于严格的硬件-软件协同优化，通过共享每个数据块的指数来压缩网络大小。之前的工作大多侧重于在特定领域的加速器上部署这种分块缩放的 GeMM 操作，以获得最佳效率，但却牺牲了灵活性和部署的便利性。在这项工作中，我们利用 ARM SVE，在完全可编程的无序矢量处理器上利用并优化了分块缩放 GeMM 的部署。我们定义了一种进行设计空间探索的系统方法，以优化工作负载规格与处理器矢量长度、不同微内核、块大小和形状的匹配。我们引入了基于本征（insinsics）的高效微内核，它具有有效的循环展开和数据传输效率高的融合重量化策略，可最大限度地提高内核性能，同时还能确保多种部署配置。我们通过可调整的块大小和形状实现了通用的块缩放内核部署，这有助于满足不同的精度-速度权衡要求。在 CIFAR10/100 数据集的 CNN 分类任务中，利用 2D 激活块而不是传统的 1D 块，静态和动态 BS-INT8 配置的速度分别比 FP32 模型平均快 3.8 倍和 2.9 倍，而且准确率没有降低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimization of block-scaled integer GeMMs for efficient DNN deployment on scalable in-order vector processors

A continuing rise in DNN usage in distributed and embedded use cases has demanded more efficient hardware execution in the field. Low-precision GeMMs with optimized data formats have played a key role in more memory and computationally-efficient networks. Recently trending formats are block-scaled representations stemming from tight HW-SW co-optimization, that compress network size by sharing exponents per data block. Prior work mostly focuses on deploying such block-scaled GeMM operations on domain-specific accelerators for optimum efficiency at the cost of flexibility and ease of deployment. In this work, we exploit and optimize the deployment of block-scaled GeMMs on fully-programmable in-order vector processors using ARM SVE. We define a systematic methodology for performing design space exploration to optimally match the workload specifications with processor vector-lengths, different microkernels, block sizes and shapes. We introduce efficient intrinsics-based microkernels with effective loop unrollings, and data-transfer efficient fused requantization strategies to maximize kernel performance, while also ensuring several deployment configurations. We enable generalized block-scaled kernel deployments through tunable block sizes and shapes, which helps in accommodating different accuracy-speed trade-off requirements. Utilizing 2D activation blocks instead of conventional 1D blocks, the static and dynamic BS-INT8 configurations yielded on average 3.8x and 2.9x faster speedups over FP32 models respectively, at no accuracy loss for CNN classification tasks on CIFAR10/100 datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Systems Architecture 工程技术-计算机：硬件

CiteScore

8.70

自引率

15.60%

发文量

226

审稿时长

46 days

期刊介绍： The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software. Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.