com - mcm:面向可扩展多芯片模块边缘机器学习的双极位稀疏优化的内存边界计算神经网络处理器

2022 IEEE International Solid- State Circuits Conference (ISSCC) Pub Date : 2022-02-20 DOI:10.1109/ISSCC42614.2022.9731657

Haozhe Zhu, Bo Jiao, Jinshan Zhang, Xinru Jia, Yunzhengmao Wang, Tianchan Guan, Shengcheng Wang, Dimin Niu, Hongzhong Zheng, Chixiao Chen, Mingyu Wang, Lihua Zhang, Xiaoyang Zeng, Qi Liu, Yu-Jin Xie, Meilin Liu

{"title":"com - mcm:面向可扩展多芯片模块边缘机器学习的双极位稀疏优化的内存边界计算神经网络处理器","authors":"Haozhe Zhu, Bo Jiao, Jinshan Zhang, Xinru Jia, Yunzhengmao Wang, Tianchan Guan, Shengcheng Wang, Dimin Niu, Hongzhong Zheng, Chixiao Chen, Mingyu Wang, Lihua Zhang, Xiaoyang Zeng, Qi Liu, Yu-Jin Xie, Meilin Liu","doi":"10.1109/ISSCC42614.2022.9731657","DOIUrl":null,"url":null,"abstract":"Recently, computing-in-memory (CIM) macros, originally designed to reduce the intensive memory accesses of Al tasks, have been employed in low-power machine learning SoCs due to their ultra-high computing efficiency [1]–[3]. These CIM macros still access weight data through on/off-chip memories, similar to processing elements in near-memory-computing architectures. The implementation poses challenges when counting the overall SoC energy efficiency (Fig. 15.3.1). First, the memory wall issue is unsolved. The weight updates affect overall system performance when large networks are deployed and massive off-chip weight data transfer occurs. Even for tiny machine learning tasks, power consumption and latency of constant weight updates cannot be neglected, because MAC computing efficiency is optimized and closely matches the efficiency of on-chip memory access (2pJ/b vs. 1pJ/b). Second, the viability of structured and coarse-grained sparsity optimization is highly algorithm dependent and requires explicit zero-detection blocks. Power optimization schemes for fine-grained or even arbitrary-sparsity patterns are lacking. Third, edge machine learning chips are cost sensitive. The conventional monolithic SoC design strategy, fabricating one specific SoC for each application, is not affordable in terms of NRE costs.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"31 1","pages":"1-3"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning\",\"authors\":\"Haozhe Zhu, Bo Jiao, Jinshan Zhang, Xinru Jia, Yunzhengmao Wang, Tianchan Guan, Shengcheng Wang, Dimin Niu, Hongzhong Zheng, Chixiao Chen, Mingyu Wang, Lihua Zhang, Xiaoyang Zeng, Qi Liu, Yu-Jin Xie, Meilin Liu\",\"doi\":\"10.1109/ISSCC42614.2022.9731657\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, computing-in-memory (CIM) macros, originally designed to reduce the intensive memory accesses of Al tasks, have been employed in low-power machine learning SoCs due to their ultra-high computing efficiency [1]–[3]. These CIM macros still access weight data through on/off-chip memories, similar to processing elements in near-memory-computing architectures. The implementation poses challenges when counting the overall SoC energy efficiency (Fig. 15.3.1). First, the memory wall issue is unsolved. The weight updates affect overall system performance when large networks are deployed and massive off-chip weight data transfer occurs. Even for tiny machine learning tasks, power consumption and latency of constant weight updates cannot be neglected, because MAC computing efficiency is optimized and closely matches the efficiency of on-chip memory access (2pJ/b vs. 1pJ/b). Second, the viability of structured and coarse-grained sparsity optimization is highly algorithm dependent and requires explicit zero-detection blocks. Power optimization schemes for fine-grained or even arbitrary-sparsity patterns are lacking. Third, edge machine learning chips are cost sensitive. The conventional monolithic SoC design strategy, fabricating one specific SoC for each application, is not affordable in terms of NRE costs.\",\"PeriodicalId\":6830,\"journal\":{\"name\":\"2022 IEEE International Solid- State Circuits Conference (ISSCC)\",\"volume\":\"31 1\",\"pages\":\"1-3\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Solid- State Circuits Conference (ISSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC42614.2022.9731657\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42614.2022.9731657","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

最近，内存计算宏(computing-in-memory, CIM)由于其超高的计算效率，被用于低功耗机器学习soc中[1]-[3]。CIM宏最初是为了减少人工智能任务对内存的密集访问而设计的。这些CIM宏仍然通过片上/片外存储器访问权重数据，类似于近内存计算体系结构中的处理元素。在计算整体SoC能效时，实现带来了挑战(图15.3.1)。首先，内存墙问题没有得到解决。当部署大型网络并发生大量片外权重数据传输时，权重更新会影响系统的整体性能。即使对于微小的机器学习任务，也不能忽视恒定权重更新的功耗和延迟，因为MAC计算效率经过优化，并且与片上存储器访问的效率(2pJ/b vs. 1pJ/b)非常匹配。其次，结构化和粗粒度稀疏性优化的可行性是高度依赖算法的，需要明确的零检测块。缺乏针对细粒度甚至任意稀疏模式的功率优化方案。第三，边缘机器学习芯片对成本敏感。传统的单片SoC设计策略，为每个应用制造一个特定的SoC，在NRE成本方面是不可承受的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning

Recently, computing-in-memory (CIM) macros, originally designed to reduce the intensive memory accesses of Al tasks, have been employed in low-power machine learning SoCs due to their ultra-high computing efficiency [1]–[3]. These CIM macros still access weight data through on/off-chip memories, similar to processing elements in near-memory-computing architectures. The implementation poses challenges when counting the overall SoC energy efficiency (Fig. 15.3.1). First, the memory wall issue is unsolved. The weight updates affect overall system performance when large networks are deployed and massive off-chip weight data transfer occurs. Even for tiny machine learning tasks, power consumption and latency of constant weight updates cannot be neglected, because MAC computing efficiency is optimized and closely matches the efficiency of on-chip memory access (2pJ/b vs. 1pJ/b). Second, the viability of structured and coarse-grained sparsity optimization is highly algorithm dependent and requires explicit zero-detection blocks. Power optimization schemes for fine-grained or even arbitrary-sparsity patterns are lacking. Third, edge machine learning chips are cost sensitive. The conventional monolithic SoC design strategy, fabricating one specific SoC for each application, is not affordable in terms of NRE costs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Solid- State Circuits Conference (ISSCC)

自引率

0.00%

发文量