通过优化PIM架构的数据流来权衡性能和能源效率

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2024-12-25 DOI:10.1109/TCAD.2024.3522879

Yunping Zhao;Sheng Ma;Yuhua Tang;Hengzhu Liu;Dongsheng Li

{"title":"通过优化PIM架构的数据流来权衡性能和能源效率","authors":"Yunping Zhao;Sheng Ma;Yuhua Tang;Hengzhu Liu;Dongsheng Li","doi":"10.1109/TCAD.2024.3522879","DOIUrl":null,"url":null,"abstract":"The processing-in-memory (PIM) architecture becomes a promising candidate for deep learning accelerators by integrating computation and memory. Most PIM-based studies improve the performance and energy efficiency by using the weight stationary (WS) data flow due to its high parallelism. However, the WS data flow has some fundamental limitations. First, the WS data flow has huge activation movements between on-chip memory and off-chip memory due to the limited memory space of the resistive random-access memory (ReRAM) array. Second, the WS data flow needs to read the input activation repeatedly according to the convolution window. These data movements decrease the energy efficiency and performance of the PIM architecture. To address these issues, the input stationary (IS) data flow stores activations instead of weights to reduce data movements. But the IS data flow faces some challenges. First, the data dependency between adjacent layers limits the performance. Second, there are huge across-array computations due to the special mapping method. Third, the previous IS data flow cannot realize the high parallelism. Fourth, the IS data flow depends on the 3-D ReRAM structure. To address these issues, we propose a novel data flow for PIM architectures. We optimize the IS data flow to decrease the activation movement and propose a parallel computing method to realize high parallelism and reduce the across-array computations. We identify and analyze the fundamental limitations and impact of different interlayer data flows, including the WS-WS, IS-IS, WS-IS, and IS-WS. We also propose a method to build a hybrid data flow by combining these interlayer data flows to tradeoff performance and energy consumption. Our experimental results and analysis demonstrate the potential of our design. The performance and energy efficiency of our design reach 0.13–1.77 TFLOPS and 61–85 TOPS/J, respectively. Compared to the state-of-the-art design, the NEBULA, our design can improve performance by <inline-formula> <tex-math>$1.4\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$2.3\\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math>$3.5\\times $ </tex-math></inline-formula> for deploying the MobileNet-V1, ResNet-18, and VGG-16, and also can improve energy efficiency by <inline-formula> <tex-math>$3.3\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$2\\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math>$2\\times $ </tex-math></inline-formula>, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2530-2543"},"PeriodicalIF":2.9000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10816195","citationCount":"0","resultStr":"{\"title\":\"Tradeoff Performance and Energy Efficiency by Optimizing the Data Flow for PIM Architectures\",\"authors\":\"Yunping Zhao;Sheng Ma;Yuhua Tang;Hengzhu Liu;Dongsheng Li\",\"doi\":\"10.1109/TCAD.2024.3522879\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The processing-in-memory (PIM) architecture becomes a promising candidate for deep learning accelerators by integrating computation and memory. Most PIM-based studies improve the performance and energy efficiency by using the weight stationary (WS) data flow due to its high parallelism. However, the WS data flow has some fundamental limitations. First, the WS data flow has huge activation movements between on-chip memory and off-chip memory due to the limited memory space of the resistive random-access memory (ReRAM) array. Second, the WS data flow needs to read the input activation repeatedly according to the convolution window. These data movements decrease the energy efficiency and performance of the PIM architecture. To address these issues, the input stationary (IS) data flow stores activations instead of weights to reduce data movements. But the IS data flow faces some challenges. First, the data dependency between adjacent layers limits the performance. Second, there are huge across-array computations due to the special mapping method. Third, the previous IS data flow cannot realize the high parallelism. Fourth, the IS data flow depends on the 3-D ReRAM structure. To address these issues, we propose a novel data flow for PIM architectures. We optimize the IS data flow to decrease the activation movement and propose a parallel computing method to realize high parallelism and reduce the across-array computations. We identify and analyze the fundamental limitations and impact of different interlayer data flows, including the WS-WS, IS-IS, WS-IS, and IS-WS. We also propose a method to build a hybrid data flow by combining these interlayer data flows to tradeoff performance and energy consumption. Our experimental results and analysis demonstrate the potential of our design. The performance and energy efficiency of our design reach 0.13–1.77 TFLOPS and 61–85 TOPS/J, respectively. Compared to the state-of-the-art design, the NEBULA, our design can improve performance by <inline-formula> <tex-math>$1.4\\\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$2.3\\\\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math>$3.5\\\\times $ </tex-math></inline-formula> for deploying the MobileNet-V1, ResNet-18, and VGG-16, and also can improve energy efficiency by <inline-formula> <tex-math>$3.3\\\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$2\\\\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math>$2\\\\times $ </tex-math></inline-formula>, respectively.\",\"PeriodicalId\":13251,\"journal\":{\"name\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"volume\":\"44 7\",\"pages\":\"2530-2543\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-12-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10816195\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10816195/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10816195/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

内存处理（PIM）架构通过集成计算和内存，成为深度学习加速器的一个很有前途的候选者。由于权重平稳（WS）数据流的高并行性，大多数基于pim的研究都利用WS数据流来提高性能和能效。然而，WS数据流有一些基本的限制。首先，由于电阻式随机存取存储器（ReRAM）阵列的存储空间有限，WS数据流在片内存储器和片外存储器之间有巨大的激活移动。其次，WS数据流需要根据卷积窗口重复读取输入激活。这些数据移动降低了PIM体系结构的能源效率和性能。为了解决这些问题，输入静止（IS）数据流存储激活而不是权重，以减少数据移动。但是IS的数据流面临着一些挑战。首先，相邻层之间的数据依赖限制了性能。其次，由于特殊的映射方法，存在巨大的跨阵列计算量。第三，以前的IS数据流无法实现高并行性。第四，IS数据流依赖于三维ReRAM结构。为了解决这些问题，我们为PIM架构提出了一种新的数据流。通过优化IS数据流来减少激活移动，并提出一种并行计算方法来实现高并行性和减少跨阵列计算。我们识别并分析了不同层间数据流的基本限制和影响，包括WS-WS、IS-IS、WS-IS和IS-WS。我们还提出了一种通过组合这些层间数据流来权衡性能和能耗来构建混合数据流的方法。我们的实验结果和分析证明了我们的设计的潜力。本设计的性能和能效分别达到0.13-1.77 TFLOPS和61-85 TOPS/J。与最先进的设计星云相比，我们的设计可以将部署MobileNet-V1、ResNet-18和VGG-16的性能提高1.4倍、2.3倍和3.5倍，并且还可以将能源效率分别提高3.3倍、2倍和2倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Tradeoff Performance and Energy Efficiency by Optimizing the Data Flow for PIM Architectures

The processing-in-memory (PIM) architecture becomes a promising candidate for deep learning accelerators by integrating computation and memory. Most PIM-based studies improve the performance and energy efficiency by using the weight stationary (WS) data flow due to its high parallelism. However, the WS data flow has some fundamental limitations. First, the WS data flow has huge activation movements between on-chip memory and off-chip memory due to the limited memory space of the resistive random-access memory (ReRAM) array. Second, the WS data flow needs to read the input activation repeatedly according to the convolution window. These data movements decrease the energy efficiency and performance of the PIM architecture. To address these issues, the input stationary (IS) data flow stores activations instead of weights to reduce data movements. But the IS data flow faces some challenges. First, the data dependency between adjacent layers limits the performance. Second, there are huge across-array computations due to the special mapping method. Third, the previous IS data flow cannot realize the high parallelism. Fourth, the IS data flow depends on the 3-D ReRAM structure. To address these issues, we propose a novel data flow for PIM architectures. We optimize the IS data flow to decrease the activation movement and propose a parallel computing method to realize high parallelism and reduce the across-array computations. We identify and analyze the fundamental limitations and impact of different interlayer data flows, including the WS-WS, IS-IS, WS-IS, and IS-WS. We also propose a method to build a hybrid data flow by combining these interlayer data flows to tradeoff performance and energy consumption. Our experimental results and analysis demonstrate the potential of our design. The performance and energy efficiency of our design reach 0.13–1.77 TFLOPS and 61–85 TOPS/J, respectively. Compared to the state-of-the-art design, the NEBULA, our design can improve performance by

$1.4\times $

$2.3\times $

, and

$3.5\times $

for deploying the MobileNet-V1, ResNet-18, and VGG-16, and also can improve energy efficiency by

$3.3\times $

$2\times $

, and

$2\times $

, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.