S3A-NPU:一种用于自监督学习和动态自适应记忆优化的高性能硬件加速器

IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Heuijee Yun;Daejin Park
{"title":"S3A-NPU:一种用于自监督学习和动态自适应记忆优化的高性能硬件加速器","authors":"Heuijee Yun;Daejin Park","doi":"10.1109/TVLSI.2025.3566949","DOIUrl":null,"url":null,"abstract":"Spiking self-supervised learning (SSL) has become prevalent for low power consumption and low-latency properties, as well as the ability to learn from large quantities of unlabeled data. However, the computational intensity and resource requirements are significant challenges to apply to accelerators. In this article, we propose the scalable, spiking self-supervised learning, streamline optimization accelerator (<inline-formula> <tex-math>$S^{3}$ </tex-math></inline-formula>A)-neural processing unit (NPU), a highly optimized accelerator for spiking SSL models. This architecture minimizes memory access by leveraging input data provided by the user and optimizes computation through the maximization of data reuse. By dynamically optimizing memory based on model characteristics and implementing specialized operations for data preprocessing, which are critical in SSL, computational efficiency can be significantly improved. The parallel processing lanes account for the two encoders in the SSL architecture, combined with a pipelined structure that considers the temporal data accumulation of spiking neural networks (SNNs) to enhance computational efficiency. We evaluate the design on field-programmable gate array (FPGA), where a 16-bit quantized spiking residual network (ResNet) model trained on the Canadian Institute for Advanced Research (CIFAR) and MNIST dataset has top 94.08% accuracy. <inline-formula> <tex-math>$S^{3}$ </tex-math></inline-formula>A-NPU optimization significantly improved computational resource utilization, resulting in a 25% reduction in latency. Moreover, as the first spiking self-supervised accelerator, it demonstrated highly efficient computation compared to existing accelerators, utilizing only 29k look up tables (LUTs) and eight block random access memories (BRAMs). This makes it highly suitable for resource-constrained applications, particularly in the context of spiking SSL models on edge devices. We implemented it on a silicon chip using a 130-nm process design kit (PDK), and the design was less than <inline-formula> <tex-math>$1~\\text {cm}^{2}$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 7","pages":"1886-1898"},"PeriodicalIF":2.8000,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11010182","citationCount":"0","resultStr":"{\"title\":\"S3A-NPU: A High-Performance Hardware Accelerator for Spiking Self-Supervised Learning With Dynamic Adaptive Memory Optimization\",\"authors\":\"Heuijee Yun;Daejin Park\",\"doi\":\"10.1109/TVLSI.2025.3566949\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spiking self-supervised learning (SSL) has become prevalent for low power consumption and low-latency properties, as well as the ability to learn from large quantities of unlabeled data. However, the computational intensity and resource requirements are significant challenges to apply to accelerators. In this article, we propose the scalable, spiking self-supervised learning, streamline optimization accelerator (<inline-formula> <tex-math>$S^{3}$ </tex-math></inline-formula>A)-neural processing unit (NPU), a highly optimized accelerator for spiking SSL models. This architecture minimizes memory access by leveraging input data provided by the user and optimizes computation through the maximization of data reuse. By dynamically optimizing memory based on model characteristics and implementing specialized operations for data preprocessing, which are critical in SSL, computational efficiency can be significantly improved. The parallel processing lanes account for the two encoders in the SSL architecture, combined with a pipelined structure that considers the temporal data accumulation of spiking neural networks (SNNs) to enhance computational efficiency. We evaluate the design on field-programmable gate array (FPGA), where a 16-bit quantized spiking residual network (ResNet) model trained on the Canadian Institute for Advanced Research (CIFAR) and MNIST dataset has top 94.08% accuracy. <inline-formula> <tex-math>$S^{3}$ </tex-math></inline-formula>A-NPU optimization significantly improved computational resource utilization, resulting in a 25% reduction in latency. Moreover, as the first spiking self-supervised accelerator, it demonstrated highly efficient computation compared to existing accelerators, utilizing only 29k look up tables (LUTs) and eight block random access memories (BRAMs). This makes it highly suitable for resource-constrained applications, particularly in the context of spiking SSL models on edge devices. We implemented it on a silicon chip using a 130-nm process design kit (PDK), and the design was less than <inline-formula> <tex-math>$1~\\\\text {cm}^{2}$ </tex-math></inline-formula>.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 7\",\"pages\":\"1886-1898\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-03-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11010182\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11010182/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11010182/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

由于低功耗和低延迟特性,以及从大量未标记数据中学习的能力,尖峰自监督学习(SSL)已经变得非常流行。然而,计算强度和资源需求是应用于加速器的重大挑战。在本文中,我们提出了可扩展的、峰值自监督学习的流线优化加速器($S^{3}$ A)——神经处理单元(NPU),一个用于峰值SSL模型的高度优化的加速器。该体系结构通过利用用户提供的输入数据来最小化内存访问,并通过最大化数据重用来优化计算。通过基于模型特征动态优化内存和实现数据预处理的专用操作(这在SSL中是至关重要的),可以显著提高计算效率。在SSL架构中,并行处理通道占两个编码器,并结合考虑峰值神经网络(snn)的时间数据积累的流水线结构来提高计算效率。我们在现场可编程门阵列(FPGA)上对设计进行了评估,其中在加拿大高级研究所(CIFAR)和MNIST数据集上训练的16位量化尖峰残余网络(ResNet)模型具有高达94.08%的准确率。npu优化显著提高了计算资源利用率,导致延迟减少25%。此外,作为第一个峰值自监督加速器,与现有加速器相比,它展示了高效的计算,仅使用29k查找表(lut)和8块随机存取存储器(bram)。这使得它非常适合资源受限的应用程序,特别是在边缘设备上使用SSL模型的上下文中。我们使用130纳米制程设计套件(PDK)在硅芯片上实现了它,设计成本小于$1~\text {cm}^{2}$。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
S3A-NPU: A High-Performance Hardware Accelerator for Spiking Self-Supervised Learning With Dynamic Adaptive Memory Optimization
Spiking self-supervised learning (SSL) has become prevalent for low power consumption and low-latency properties, as well as the ability to learn from large quantities of unlabeled data. However, the computational intensity and resource requirements are significant challenges to apply to accelerators. In this article, we propose the scalable, spiking self-supervised learning, streamline optimization accelerator ( $S^{3}$ A)-neural processing unit (NPU), a highly optimized accelerator for spiking SSL models. This architecture minimizes memory access by leveraging input data provided by the user and optimizes computation through the maximization of data reuse. By dynamically optimizing memory based on model characteristics and implementing specialized operations for data preprocessing, which are critical in SSL, computational efficiency can be significantly improved. The parallel processing lanes account for the two encoders in the SSL architecture, combined with a pipelined structure that considers the temporal data accumulation of spiking neural networks (SNNs) to enhance computational efficiency. We evaluate the design on field-programmable gate array (FPGA), where a 16-bit quantized spiking residual network (ResNet) model trained on the Canadian Institute for Advanced Research (CIFAR) and MNIST dataset has top 94.08% accuracy. $S^{3}$ A-NPU optimization significantly improved computational resource utilization, resulting in a 25% reduction in latency. Moreover, as the first spiking self-supervised accelerator, it demonstrated highly efficient computation compared to existing accelerators, utilizing only 29k look up tables (LUTs) and eight block random access memories (BRAMs). This makes it highly suitable for resource-constrained applications, particularly in the context of spiking SSL models on edge devices. We implemented it on a silicon chip using a 130-nm process design kit (PDK), and the design was less than $1~\text {cm}^{2}$ .
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
6.40
自引率
7.10%
发文量
187
审稿时长
3.6 months
期刊介绍: The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信