com - mcm:面向可扩展多芯片模块边缘机器学习的双极位稀疏优化的内存边界计算神经网络处理器

Haozhe Zhu, Bo Jiao, Jinshan Zhang, Xinru Jia, Yunzhengmao Wang, Tianchan Guan, Shengcheng Wang, Dimin Niu, Hongzhong Zheng, Chixiao Chen, Mingyu Wang, Lihua Zhang, Xiaoyang Zeng, Qi Liu, Yu-Jin Xie, Meilin Liu
{"title":"com - mcm:面向可扩展多芯片模块边缘机器学习的双极位稀疏优化的内存边界计算神经网络处理器","authors":"Haozhe Zhu, Bo Jiao, Jinshan Zhang, Xinru Jia, Yunzhengmao Wang, Tianchan Guan, Shengcheng Wang, Dimin Niu, Hongzhong Zheng, Chixiao Chen, Mingyu Wang, Lihua Zhang, Xiaoyang Zeng, Qi Liu, Yu-Jin Xie, Meilin Liu","doi":"10.1109/ISSCC42614.2022.9731657","DOIUrl":null,"url":null,"abstract":"Recently, computing-in-memory (CIM) macros, originally designed to reduce the intensive memory accesses of Al tasks, have been employed in low-power machine learning SoCs due to their ultra-high computing efficiency [1]–[3]. These CIM macros still access weight data through on/off-chip memories, similar to processing elements in near-memory-computing architectures. The implementation poses challenges when counting the overall SoC energy efficiency (Fig. 15.3.1). First, the memory wall issue is unsolved. The weight updates affect overall system performance when large networks are deployed and massive off-chip weight data transfer occurs. Even for tiny machine learning tasks, power consumption and latency of constant weight updates cannot be neglected, because MAC computing efficiency is optimized and closely matches the efficiency of on-chip memory access (2pJ/b vs. 1pJ/b). Second, the viability of structured and coarse-grained sparsity optimization is highly algorithm dependent and requires explicit zero-detection blocks. Power optimization schemes for fine-grained or even arbitrary-sparsity patterns are lacking. Third, edge machine learning chips are cost sensitive. The conventional monolithic SoC design strategy, fabricating one specific SoC for each application, is not affordable in terms of NRE costs.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"31 1","pages":"1-3"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning\",\"authors\":\"Haozhe Zhu, Bo Jiao, Jinshan Zhang, Xinru Jia, Yunzhengmao Wang, Tianchan Guan, Shengcheng Wang, Dimin Niu, Hongzhong Zheng, Chixiao Chen, Mingyu Wang, Lihua Zhang, Xiaoyang Zeng, Qi Liu, Yu-Jin Xie, Meilin Liu\",\"doi\":\"10.1109/ISSCC42614.2022.9731657\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, computing-in-memory (CIM) macros, originally designed to reduce the intensive memory accesses of Al tasks, have been employed in low-power machine learning SoCs due to their ultra-high computing efficiency [1]–[3]. These CIM macros still access weight data through on/off-chip memories, similar to processing elements in near-memory-computing architectures. The implementation poses challenges when counting the overall SoC energy efficiency (Fig. 15.3.1). First, the memory wall issue is unsolved. The weight updates affect overall system performance when large networks are deployed and massive off-chip weight data transfer occurs. Even for tiny machine learning tasks, power consumption and latency of constant weight updates cannot be neglected, because MAC computing efficiency is optimized and closely matches the efficiency of on-chip memory access (2pJ/b vs. 1pJ/b). Second, the viability of structured and coarse-grained sparsity optimization is highly algorithm dependent and requires explicit zero-detection blocks. Power optimization schemes for fine-grained or even arbitrary-sparsity patterns are lacking. Third, edge machine learning chips are cost sensitive. The conventional monolithic SoC design strategy, fabricating one specific SoC for each application, is not affordable in terms of NRE costs.\",\"PeriodicalId\":6830,\"journal\":{\"name\":\"2022 IEEE International Solid- State Circuits Conference (ISSCC)\",\"volume\":\"31 1\",\"pages\":\"1-3\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Solid- State Circuits Conference (ISSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC42614.2022.9731657\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42614.2022.9731657","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

最近,内存计算宏(computing-in-memory, CIM)由于其超高的计算效率,被用于低功耗机器学习soc中[1]-[3]。CIM宏最初是为了减少人工智能任务对内存的密集访问而设计的。这些CIM宏仍然通过片上/片外存储器访问权重数据,类似于近内存计算体系结构中的处理元素。在计算整体SoC能效时,实现带来了挑战(图15.3.1)。首先,内存墙问题没有得到解决。当部署大型网络并发生大量片外权重数据传输时,权重更新会影响系统的整体性能。即使对于微小的机器学习任务,也不能忽视恒定权重更新的功耗和延迟,因为MAC计算效率经过优化,并且与片上存储器访问的效率(2pJ/b vs. 1pJ/b)非常匹配。其次,结构化和粗粒度稀疏性优化的可行性是高度依赖算法的,需要明确的零检测块。缺乏针对细粒度甚至任意稀疏模式的功率优化方案。第三,边缘机器学习芯片对成本敏感。传统的单片SoC设计策略,为每个应用制造一个特定的SoC,在NRE成本方面是不可承受的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning
Recently, computing-in-memory (CIM) macros, originally designed to reduce the intensive memory accesses of Al tasks, have been employed in low-power machine learning SoCs due to their ultra-high computing efficiency [1]–[3]. These CIM macros still access weight data through on/off-chip memories, similar to processing elements in near-memory-computing architectures. The implementation poses challenges when counting the overall SoC energy efficiency (Fig. 15.3.1). First, the memory wall issue is unsolved. The weight updates affect overall system performance when large networks are deployed and massive off-chip weight data transfer occurs. Even for tiny machine learning tasks, power consumption and latency of constant weight updates cannot be neglected, because MAC computing efficiency is optimized and closely matches the efficiency of on-chip memory access (2pJ/b vs. 1pJ/b). Second, the viability of structured and coarse-grained sparsity optimization is highly algorithm dependent and requires explicit zero-detection blocks. Power optimization schemes for fine-grained or even arbitrary-sparsity patterns are lacking. Third, edge machine learning chips are cost sensitive. The conventional monolithic SoC design strategy, fabricating one specific SoC for each application, is not affordable in terms of NRE costs.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信