基于28纳米CMOS交错管道和内存计算集群的片上训练关键字定位芯片

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-01-10 DOI:10.1109/TVLSI.2025.3525740

Junyi Qian;Cai Li;Long Chen;Ruidong Li;Tuo Li;Peng Cao;Xin Si;Weiwei Shan

{"title":"基于28纳米CMOS交错管道和内存计算集群的片上训练关键字定位芯片","authors":"Junyi Qian;Cai Li;Long Chen;Ruidong Li;Tuo Li;Peng Cao;Xin Si;Weiwei Shan","doi":"10.1109/TVLSI.2025.3525740","DOIUrl":null,"url":null,"abstract":"To improve the precision of keyword spotting (KWS) for individual users on edge devices, we propose an on-chip-training KWS (OCT-KWS) chip for private data protection while also achieving ultralow -power inference. Our main contributions are: 1) identity interchange and interleaved pipeline methods during backpropagation (BP), enabling the pipelined execution of operations that traditionally had to be performed sequentially, reducing cache requirements for loss values by 95.8%; 2) all-digital isolated-bitline (BL)-based computation-in-memory (CIM) macro, eliminating ineffective computations caused by glitches, achieving <inline-formula> <tex-math>$2.03\\times $ </tex-math></inline-formula> higher energy efficiency; and 3) multisize CIM cluster-based BP data flow, designing each CIM macro collaboratively to achieve all-time full utilization, reducing 47.2% of output feature map (Ofmap) access. Fabricated in 28-nm CMOS and enhanced with a refined library characterization methodology, this chip achieves both the highest training energy efficiency of 101.5 TOPS/W and the lowest inference energy of 9.9nJ/decision among current KWS chips. By retraining a three-class depthwise-separable convolutional neural network (DSCNN), detection accuracy on the private dataset increases from 80.8% to 98.9%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 5","pages":"1497-1501"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An On-Chip-Training Keyword-Spotting Chip Using Interleaved Pipeline and Computation-in-Memory Cluster in 28-nm CMOS\",\"authors\":\"Junyi Qian;Cai Li;Long Chen;Ruidong Li;Tuo Li;Peng Cao;Xin Si;Weiwei Shan\",\"doi\":\"10.1109/TVLSI.2025.3525740\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"To improve the precision of keyword spotting (KWS) for individual users on edge devices, we propose an on-chip-training KWS (OCT-KWS) chip for private data protection while also achieving ultralow -power inference. Our main contributions are: 1) identity interchange and interleaved pipeline methods during backpropagation (BP), enabling the pipelined execution of operations that traditionally had to be performed sequentially, reducing cache requirements for loss values by 95.8%; 2) all-digital isolated-bitline (BL)-based computation-in-memory (CIM) macro, eliminating ineffective computations caused by glitches, achieving <inline-formula> <tex-math>$2.03\\\\times $ </tex-math></inline-formula> higher energy efficiency; and 3) multisize CIM cluster-based BP data flow, designing each CIM macro collaboratively to achieve all-time full utilization, reducing 47.2% of output feature map (Ofmap) access. Fabricated in 28-nm CMOS and enhanced with a refined library characterization methodology, this chip achieves both the highest training energy efficiency of 101.5 TOPS/W and the lowest inference energy of 9.9nJ/decision among current KWS chips. By retraining a three-class depthwise-separable convolutional neural network (DSCNN), detection accuracy on the private dataset increases from 80.8% to 98.9%.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 5\",\"pages\":\"1497-1501\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10838343/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10838343/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

为了提高边缘设备上个人用户关键字识别（KWS）的精度，我们提出了一种片上训练KWS （OCT-KWS）芯片，用于私人数据保护，同时还实现了超低功耗推断。我们的主要贡献是：1)反向传播（BP）期间的身份交换和交错管道方法，使传统上必须顺序执行的操作能够流水线执行，将丢失值的缓存需求降低了95.8%；2)基于全数字隔离位线（BL）的内存计算（CIM）宏，消除了因故障导致的无效计算，实现了2.03倍的能源效率提升；3)基于多尺度CIM集群的BP数据流，协同设计各CIM宏，实现全时全利用率，减少输出特征图（Ofmap）访问47.2%。在现有的KWS芯片中，该芯片的最高训练能量效率为101.5 TOPS/W，最低推理能量为9.9nJ/decision。通过对三级深度可分离卷积神经网络（DSCNN）进行再训练，在私有数据集上的检测准确率从80.8%提高到98.9%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An On-Chip-Training Keyword-Spotting Chip Using Interleaved Pipeline and Computation-in-Memory Cluster in 28-nm CMOS

To improve the precision of keyword spotting (KWS) for individual users on edge devices, we propose an on-chip-training KWS (OCT-KWS) chip for private data protection while also achieving ultralow -power inference. Our main contributions are: 1) identity interchange and interleaved pipeline methods during backpropagation (BP), enabling the pipelined execution of operations that traditionally had to be performed sequentially, reducing cache requirements for loss values by 95.8%; 2) all-digital isolated-bitline (BL)-based computation-in-memory (CIM) macro, eliminating ineffective computations caused by glitches, achieving

$2.03\times $

higher energy efficiency; and 3) multisize CIM cluster-based BP data flow, designing each CIM macro collaboratively to achieve all-time full utilization, reducing 47.2% of output feature map (Ofmap) access. Fabricated in 28-nm CMOS and enhanced with a refined library characterization methodology, this chip achieves both the highest training energy efficiency of 101.5 TOPS/W and the lowest inference energy of 9.9nJ/decision among current KWS chips. By retraining a three-class depthwise-separable convolutional neural network (DSCNN), detection accuracy on the private dataset increases from 80.8% to 98.9%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.