AIA：用于16nm离散采样工作负载的定制多核RISC-V SoC

IF 4.6 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal of Solid-state Circuits Pub Date : 2025-04-30 DOI:10.1109/JSSC.2025.3561880

Shirui Zhao;Nimish Shah;Wannes Meert;Marian Verhelst

{"title":"AIA：用于16nm离散采样工作负载的定制多核RISC-V SoC","authors":"Shirui Zhao;Nimish Shah;Wannes Meert;Marian Verhelst","doi":"10.1109/JSSC.2025.3561880","DOIUrl":null,"url":null,"abstract":"Probabilistic models (PMs) are essential in advancing machine learning capabilities, particularly in safety-critical applications involving reasoning and decision-making. Among the methods employed for inference in these models, sampling-based Markov chain Monte Carlo (MCMC) techniques are widely used. However, MCMC methods come with significant computational costs and are inherently challenging to parallelize, resulting in inefficient execution on conventional CPU/GPU platforms. To overcome these challenges, this article presents an approximate inference accelerator (AIA), a multi-core RISC-V system-on-chip (SoC) design fabricated using Intel’s 16 nm process technology. Our AIA is specifically designed to empower edge devices with robust decision-making and reasoning abilities. The AIA architecture incorporates an RISC-V host processor to manage chip-to-chip data communication and a 2-D mesh of 16 custom versatile RISC-V cores optimized for high-efficiency approximate inference. Each core features: 1) custom instructions and datapath blocks for non-normalized Knuth-Yao (KY) sampling, as well as for the interpolation of non-linear functions (e.g., logarithmic and exponential), and 2) direct data-access to the register file (RF) of each neighboring core, to reduce the data movement costs of frequent data exchanges between nearby cores. To further capitalize on the parallelism potential in MCMC algorithms, we developed a specialized compile chain that enables efficient spatial mapping and scheduling across the cores. As a result, AIA attains a peak sampling rate of 1277 MSamples/s at 0.9 V and achieves an energy efficiency of 20 GSamples/s/W at 0.7 V, surpassing the previous state-of-the-art (SotA) ASIC accelerator for probabilistic inference by up to <inline-formula> <tex-math>$6\\times $ </tex-math></inline-formula> in speed and <inline-formula> <tex-math>$5\\times $ </tex-math></inline-formula> in energy efficiency. Furthermore, the AIA’s versatility is demonstrated through the successful mapping of different types of PM workloads onto the chip.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":"60 7","pages":"2447-2460"},"PeriodicalIF":4.6000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AIA: A Customized Multi-Core RISC-V SoC for Discrete Sampling Workloads in 16 nm\",\"authors\":\"Shirui Zhao;Nimish Shah;Wannes Meert;Marian Verhelst\",\"doi\":\"10.1109/JSSC.2025.3561880\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Probabilistic models (PMs) are essential in advancing machine learning capabilities, particularly in safety-critical applications involving reasoning and decision-making. Among the methods employed for inference in these models, sampling-based Markov chain Monte Carlo (MCMC) techniques are widely used. However, MCMC methods come with significant computational costs and are inherently challenging to parallelize, resulting in inefficient execution on conventional CPU/GPU platforms. To overcome these challenges, this article presents an approximate inference accelerator (AIA), a multi-core RISC-V system-on-chip (SoC) design fabricated using Intel’s 16 nm process technology. Our AIA is specifically designed to empower edge devices with robust decision-making and reasoning abilities. The AIA architecture incorporates an RISC-V host processor to manage chip-to-chip data communication and a 2-D mesh of 16 custom versatile RISC-V cores optimized for high-efficiency approximate inference. Each core features: 1) custom instructions and datapath blocks for non-normalized Knuth-Yao (KY) sampling, as well as for the interpolation of non-linear functions (e.g., logarithmic and exponential), and 2) direct data-access to the register file (RF) of each neighboring core, to reduce the data movement costs of frequent data exchanges between nearby cores. To further capitalize on the parallelism potential in MCMC algorithms, we developed a specialized compile chain that enables efficient spatial mapping and scheduling across the cores. As a result, AIA attains a peak sampling rate of 1277 MSamples/s at 0.9 V and achieves an energy efficiency of 20 GSamples/s/W at 0.7 V, surpassing the previous state-of-the-art (SotA) ASIC accelerator for probabilistic inference by up to <inline-formula> <tex-math>$6\\\\times $ </tex-math></inline-formula> in speed and <inline-formula> <tex-math>$5\\\\times $ </tex-math></inline-formula> in energy efficiency. Furthermore, the AIA’s versatility is demonstrated through the successful mapping of different types of PM workloads onto the chip.\",\"PeriodicalId\":13129,\"journal\":{\"name\":\"IEEE Journal of Solid-state Circuits\",\"volume\":\"60 7\",\"pages\":\"2447-2460\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-04-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Solid-state Circuits\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10980265/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10980265/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

概率模型（pm）对于提高机器学习能力至关重要，特别是在涉及推理和决策的安全关键应用中。在这些模型中所采用的推理方法中，基于抽样的马尔可夫链蒙特卡罗（MCMC）技术被广泛使用。然而，MCMC方法带来了巨大的计算成本，并且固有地挑战并行化，导致在传统CPU/GPU平台上执行效率低下。为了克服这些挑战，本文提出了一种近似推理加速器（AIA），一种采用英特尔16纳米工艺技术制造的多核RISC-V片上系统（SoC）设计。我们的AIA专为赋予边缘设备强大的决策和推理能力而设计。AIA架构包含一个RISC-V主机处理器，用于管理芯片到芯片的数据通信，以及一个由16个定制通用RISC-V内核组成的二维网格，该RISC-V内核针对高效近似推理进行了优化。每个核心特征：1)自定义指令和数据路径块，用于非规范化knut - yao （KY）采样，以及非线性函数（例如对数和指数）的插值，以及2)对每个相邻核心的寄存器文件（RF）的直接数据访问，以减少附近核心之间频繁数据交换的数据移动成本。为了进一步利用MCMC算法中的并行性潜力，我们开发了一个专门的编译链，可以实现跨核心的有效空间映射和调度。因此，AIA在0.9 V下实现了1277 MSamples/s的峰值采样率，在0.7 V下实现了20 GSamples/s/W的能量效率，在概率推理方面超过了之前最先进的（SotA） ASIC加速器，速度提高了6倍，能效提高了5倍。此外，AIA的多功能性通过将不同类型的PM工作负载成功映射到芯片上得到了证明。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

AIA: A Customized Multi-Core RISC-V SoC for Discrete Sampling Workloads in 16 nm

Probabilistic models (PMs) are essential in advancing machine learning capabilities, particularly in safety-critical applications involving reasoning and decision-making. Among the methods employed for inference in these models, sampling-based Markov chain Monte Carlo (MCMC) techniques are widely used. However, MCMC methods come with significant computational costs and are inherently challenging to parallelize, resulting in inefficient execution on conventional CPU/GPU platforms. To overcome these challenges, this article presents an approximate inference accelerator (AIA), a multi-core RISC-V system-on-chip (SoC) design fabricated using Intel’s 16 nm process technology. Our AIA is specifically designed to empower edge devices with robust decision-making and reasoning abilities. The AIA architecture incorporates an RISC-V host processor to manage chip-to-chip data communication and a 2-D mesh of 16 custom versatile RISC-V cores optimized for high-efficiency approximate inference. Each core features: 1) custom instructions and datapath blocks for non-normalized Knuth-Yao (KY) sampling, as well as for the interpolation of non-linear functions (e.g., logarithmic and exponential), and 2) direct data-access to the register file (RF) of each neighboring core, to reduce the data movement costs of frequent data exchanges between nearby cores. To further capitalize on the parallelism potential in MCMC algorithms, we developed a specialized compile chain that enables efficient spatial mapping and scheduling across the cores. As a result, AIA attains a peak sampling rate of 1277 MSamples/s at 0.9 V and achieves an energy efficiency of 20 GSamples/s/W at 0.7 V, surpassing the previous state-of-the-art (SotA) ASIC accelerator for probabilistic inference by up to

$6\times $

in speed and

$5\times $

in energy efficiency. Furthermore, the AIA’s versatility is demonstrated through the successful mapping of different types of PM workloads onto the chip.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal of Solid-state Circuits 工程技术-工程：电子与电气

CiteScore

11.00

自引率

20.40%

发文量

351

审稿时长

3-6 weeks

期刊介绍： The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.