NLU：一种适用于边缘和物联网应用的自适应、小占地、低功耗神经学习单元

IF 2.4 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE open journal of circuits and systems Pub Date : 2025-02-26 DOI:10.1109/OJCAS.2025.3546067

Amirhossein Rostami;Seyed Mohammad Ali Zeinolabedin;Liyuan Guo;Florian Kelber;Heiner Bauer;Andreas Dixius;Stefan Scholze;Marc Berthel;Dennis Walter;Johannes Uhlig;Bernhard Vogginger;Christian Mayr

{"title":"NLU：一种适用于边缘和物联网应用的自适应、小占地、低功耗神经学习单元","authors":"Amirhossein Rostami;Seyed Mohammad Ali Zeinolabedin;Liyuan Guo;Florian Kelber;Heiner Bauer;Andreas Dixius;Stefan Scholze;Marc Berthel;Dennis Walter;Johannes Uhlig;Bernhard Vogginger;Christian Mayr","doi":"10.1109/OJCAS.2025.3546067","DOIUrl":null,"url":null,"abstract":"Over the last few years, online training of deep neural networks (DNNs) on edge and mobile devices has attracted increasing interest in practical use cases due to their adaptability to new environments, personalization, and privacy preservation. Despite these advantages, online learning on resource-restricted devices is challenging. This work demonstrates a 16-bit floating-point, flexible, power- and memory-efficient neural learning unit (NLU) that can be integrated into processors to accelerate the learning process. To achieve this, we implemented three key strategies: a dynamic control unit, a tile allocation engine, and a neural compute pipeline, which together enhance data reuse and improve the flexibility of the NLU. The NLU was integrated into a system-on-chip (SoC) featuring a 32-bit RISC-V core and memory subsystems, fabricated using GlobalFoundries 22nm FDSOI technology. The design occupies just <inline-formula> <tex-math>$0.015mm^{2}$ </tex-math></inline-formula> of silicon area and consumes only 0.379 mW of power. The results show that the NLU can accelerate the training process by up to <inline-formula> <tex-math>$24.38\\times $ </tex-math></inline-formula> and reduce energy consumption by up to <inline-formula> <tex-math>$37.37\\times $ </tex-math></inline-formula> compared to a RISC-V implementation with a floating-point unit (FPU). Additionally, compared to the state-of-the-art RISC-V with vector coprocessor, the NLU achieves <inline-formula> <tex-math>$4.2\\times $ </tex-math></inline-formula> higher energy efficiency (measured in GFLOPS/W). These results demonstrate the feasibility of our design for edge and IoT devices, positioning it favorably among state-of-the-art on-chip learning solutions. Furthermore, we performed mixed-precision on-chip training from scratch for keyword spotting tasks using the Google Speech Commands (GSC) dataset. Training on just 40% of the dataset, the NLU achieved a training accuracy of 89.34% with stochastic rounding.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"6 ","pages":"85-99"},"PeriodicalIF":2.4000,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10904478","citationCount":"0","resultStr":"{\"title\":\"NLU: An Adaptive, Small-Footprint, Low-Power Neural Learning Unit for Edge and IoT Applications\",\"authors\":\"Amirhossein Rostami;Seyed Mohammad Ali Zeinolabedin;Liyuan Guo;Florian Kelber;Heiner Bauer;Andreas Dixius;Stefan Scholze;Marc Berthel;Dennis Walter;Johannes Uhlig;Bernhard Vogginger;Christian Mayr\",\"doi\":\"10.1109/OJCAS.2025.3546067\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Over the last few years, online training of deep neural networks (DNNs) on edge and mobile devices has attracted increasing interest in practical use cases due to their adaptability to new environments, personalization, and privacy preservation. Despite these advantages, online learning on resource-restricted devices is challenging. This work demonstrates a 16-bit floating-point, flexible, power- and memory-efficient neural learning unit (NLU) that can be integrated into processors to accelerate the learning process. To achieve this, we implemented three key strategies: a dynamic control unit, a tile allocation engine, and a neural compute pipeline, which together enhance data reuse and improve the flexibility of the NLU. The NLU was integrated into a system-on-chip (SoC) featuring a 32-bit RISC-V core and memory subsystems, fabricated using GlobalFoundries 22nm FDSOI technology. The design occupies just <inline-formula> <tex-math>$0.015mm^{2}$ </tex-math></inline-formula> of silicon area and consumes only 0.379 mW of power. The results show that the NLU can accelerate the training process by up to <inline-formula> <tex-math>$24.38\\\\times $ </tex-math></inline-formula> and reduce energy consumption by up to <inline-formula> <tex-math>$37.37\\\\times $ </tex-math></inline-formula> compared to a RISC-V implementation with a floating-point unit (FPU). Additionally, compared to the state-of-the-art RISC-V with vector coprocessor, the NLU achieves <inline-formula> <tex-math>$4.2\\\\times $ </tex-math></inline-formula> higher energy efficiency (measured in GFLOPS/W). These results demonstrate the feasibility of our design for edge and IoT devices, positioning it favorably among state-of-the-art on-chip learning solutions. Furthermore, we performed mixed-precision on-chip training from scratch for keyword spotting tasks using the Google Speech Commands (GSC) dataset. Training on just 40% of the dataset, the NLU achieved a training accuracy of 89.34% with stochastic rounding.\",\"PeriodicalId\":93442,\"journal\":{\"name\":\"IEEE open journal of circuits and systems\",\"volume\":\"6 \",\"pages\":\"85-99\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-02-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10904478\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE open journal of circuits and systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10904478/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of circuits and systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10904478/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

在过去几年里，边缘和移动设备上的深度神经网络（DNN）在线训练因其对新环境的适应性、个性化和隐私保护而在实际应用案例中引起了越来越多的兴趣。尽管有这些优势，但在资源受限的设备上进行在线学习仍具有挑战性。这项工作展示了一种 16 位浮点、灵活、省电和内存的神经学习单元（NLU），它可以集成到处理器中以加速学习过程。为此，我们实施了三个关键策略：动态控制单元、瓦片分配引擎和神经计算流水线，它们共同加强了数据重用，提高了 NLU 的灵活性。NLU 被集成到一个系统级芯片 (SoC) 中，该芯片采用 GlobalFoundries 22nm FDSOI 技术制造，具有 32 位 RISC-V 内核和内存子系统。该设计仅占用 0.015mm^{2}$ 硅面积，功耗仅为 0.379 mW。结果表明，与带浮点单元（FPU）的RISC-V实现相比，NLU可将训练过程加速24.38倍，将能耗降低37.37倍。此外，与最先进的带矢量协处理器的 RISC-V 相比，NLU 的能效提高了 4.2 倍（以 GFLOPS/W 为单位）。这些结果证明了我们的设计对于边缘和物联网设备的可行性，使其在最先进的片上学习解决方案中处于有利地位。此外，我们还利用谷歌语音命令（GSC）数据集为关键词识别任务进行了从头开始的混合精度片上训练。仅在 40% 的数据集上进行训练，NLU 的随机舍入训练准确率就达到了 89.34%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

NLU: An Adaptive, Small-Footprint, Low-Power Neural Learning Unit for Edge and IoT Applications

Over the last few years, online training of deep neural networks (DNNs) on edge and mobile devices has attracted increasing interest in practical use cases due to their adaptability to new environments, personalization, and privacy preservation. Despite these advantages, online learning on resource-restricted devices is challenging. This work demonstrates a 16-bit floating-point, flexible, power- and memory-efficient neural learning unit (NLU) that can be integrated into processors to accelerate the learning process. To achieve this, we implemented three key strategies: a dynamic control unit, a tile allocation engine, and a neural compute pipeline, which together enhance data reuse and improve the flexibility of the NLU. The NLU was integrated into a system-on-chip (SoC) featuring a 32-bit RISC-V core and memory subsystems, fabricated using GlobalFoundries 22nm FDSOI technology. The design occupies just

$0.015mm^{2}$

of silicon area and consumes only 0.379 mW of power. The results show that the NLU can accelerate the training process by up to

$24.38\times $

and reduce energy consumption by up to

$37.37\times $

compared to a RISC-V implementation with a floating-point unit (FPU). Additionally, compared to the state-of-the-art RISC-V with vector coprocessor, the NLU achieves

$4.2\times $

higher energy efficiency (measured in GFLOPS/W). These results demonstrate the feasibility of our design for edge and IoT devices, positioning it favorably among state-of-the-art on-chip learning solutions. Furthermore, we performed mixed-precision on-chip training from scratch for keyword spotting tasks using the Google Speech Commands (GSC) dataset. Training on just 40% of the dataset, the NLU achieved a training accuracy of 89.34% with stochastic rounding.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE open journal of circuits and systems

自引率

0.00%

发文量

审稿时长

19 weeks