RISC-V 内核上的 MiniFloats：使用混合精度短点积的 ISA 扩展

IF 5.1 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Emerging Topics in Computing Pub Date : 2024-02-19 DOI:10.1109/TETC.2024.3365354

Luca Bertaccini;Gianna Paulin;Matheus Cavalcante;Tim Fischer;Stefan Mach;Luca Benini

{"title":"RISC-V 内核上的 MiniFloats：使用混合精度短点积的 ISA 扩展","authors":"Luca Bertaccini;Gianna Paulin;Matheus Cavalcante;Tim Fischer;Stefan Mach;Luca Benini","doi":"10.1109/TETC.2024.3365354","DOIUrl":null,"url":null,"abstract":"Low-precision floating-point (FP) formats have recently been intensely investigated in the context of machine learning inference and training applications. While 16-bit formats are already widely used, 8-bit FP data types have lately emerged as a viable option for neural network training when employed in a mixed-precision scenario and combined with rounding methods increasing the precision in compound additions, such as stochastic rounding. So far, hardware implementations supporting FP8 are mostly implemented within domain-specific accelerators. We propose two RISC-V instruction set architecture (ISA) extensions, enhancing respectively scalar and vector general-purpose cores with low and mixed-precision capabilities. The extensions support two 8-bit and two 16-bit FP formats and are based on dot-product instructions accumulating at higher precision. We develop a hardware unit supporting mixed-precision dot products and stochastic rounding and integrate it into an open-source floating-point unit (FPU). Finally, we integrate the enhanced FPU into a cluster of scalar cores, as well as a cluster of vector cores, and implement them in a 12 nm FinFET technology. The former achieves 575 GFLOPS/W on FP8-to-FP16 matrix multiplications at 0.8 V, 1.26 GHz; the latter reaches 860 GFLOPS/W at 0.8 V, 1.08 GHz, 1.93x higher efficiency than computing on FP16-to-FP32.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"12 4","pages":"1040-1055"},"PeriodicalIF":5.1000,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MiniFloats on RISC-V Cores: ISA Extensions With Mixed-Precision Short Dot Products\",\"authors\":\"Luca Bertaccini;Gianna Paulin;Matheus Cavalcante;Tim Fischer;Stefan Mach;Luca Benini\",\"doi\":\"10.1109/TETC.2024.3365354\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Low-precision floating-point (FP) formats have recently been intensely investigated in the context of machine learning inference and training applications. While 16-bit formats are already widely used, 8-bit FP data types have lately emerged as a viable option for neural network training when employed in a mixed-precision scenario and combined with rounding methods increasing the precision in compound additions, such as stochastic rounding. So far, hardware implementations supporting FP8 are mostly implemented within domain-specific accelerators. We propose two RISC-V instruction set architecture (ISA) extensions, enhancing respectively scalar and vector general-purpose cores with low and mixed-precision capabilities. The extensions support two 8-bit and two 16-bit FP formats and are based on dot-product instructions accumulating at higher precision. We develop a hardware unit supporting mixed-precision dot products and stochastic rounding and integrate it into an open-source floating-point unit (FPU). Finally, we integrate the enhanced FPU into a cluster of scalar cores, as well as a cluster of vector cores, and implement them in a 12 nm FinFET technology. The former achieves 575 GFLOPS/W on FP8-to-FP16 matrix multiplications at 0.8 V, 1.26 GHz; the latter reaches 860 GFLOPS/W at 0.8 V, 1.08 GHz, 1.93x higher efficiency than computing on FP16-to-FP32.\",\"PeriodicalId\":13156,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computing\",\"volume\":\"12 4\",\"pages\":\"1040-1055\"},\"PeriodicalIF\":5.1000,\"publicationDate\":\"2024-02-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10440050/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10440050/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

低精度浮点（FP）格式最近在机器学习推理和训练应用中得到了广泛的研究。虽然16位格式已经被广泛使用，但8位FP数据类型最近成为神经网络训练的可行选择，当在混合精度场景中使用并与四舍五入方法结合使用时，可以提高复合加法的精度，例如随机四舍五入。到目前为止，支持FP8的硬件实现大多是在特定领域的加速器中实现的。我们提出了两种RISC-V指令集架构（ISA）扩展，分别增强了具有低精度和混合精度能力的标量和矢量通用内核。扩展支持两个8位和两个16位FP格式，并基于点积指令以更高的精度累积。我们开发了一个支持混合精度点积和随机舍入的硬件单元，并将其集成到一个开源的浮点单元（FPU）中。最后，我们将增强的FPU集成到标量核集群以及矢量核集群中，并在12 nm FinFET技术中实现它们。前者在0.8 V， 1.26 GHz下在fp8到fp16矩阵乘法上实现575 GFLOPS/W；后者在0.8 V， 1.08 GHz下达到860 GFLOPS/W，比在fp16到fp32上计算效率高1.93倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MiniFloats on RISC-V Cores: ISA Extensions With Mixed-Precision Short Dot Products

Low-precision floating-point (FP) formats have recently been intensely investigated in the context of machine learning inference and training applications. While 16-bit formats are already widely used, 8-bit FP data types have lately emerged as a viable option for neural network training when employed in a mixed-precision scenario and combined with rounding methods increasing the precision in compound additions, such as stochastic rounding. So far, hardware implementations supporting FP8 are mostly implemented within domain-specific accelerators. We propose two RISC-V instruction set architecture (ISA) extensions, enhancing respectively scalar and vector general-purpose cores with low and mixed-precision capabilities. The extensions support two 8-bit and two 16-bit FP formats and are based on dot-product instructions accumulating at higher precision. We develop a hardware unit supporting mixed-precision dot products and stochastic rounding and integrate it into an open-source floating-point unit (FPU). Finally, we integrate the enhanced FPU into a cluster of scalar cores, as well as a cluster of vector cores, and implement them in a 12 nm FinFET technology. The former achieves 575 GFLOPS/W on FP8-to-FP16 matrix multiplications at 0.8 V, 1.26 GHz; the latter reaches 860 GFLOPS/W at 0.8 V, 1.08 GHz, 1.93x higher efficiency than computing on FP16-to-FP32.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Emerging Topics in Computing Computer Science-Computer Science (miscellaneous)

CiteScore

12.10

自引率

5.10%

发文量

113

期刊介绍： IEEE Transactions on Emerging Topics in Computing publishes papers on emerging aspects of computer science, computing technology, and computing applications not currently covered by other IEEE Computer Society Transactions. Some examples of emerging topics in computing include: IT for Green, Synthetic and organic computing structures and systems, Advanced analytics, Social/occupational computing, Location-based/client computer systems, Morphic computer design, Electronic game systems, & Health-care IT.