Yunping Zhao;Sheng Ma;Yuhua Tang;Hengzhu Liu;Dongsheng Li
{"title":"通过优化PIM架构的数据流来权衡性能和能源效率","authors":"Yunping Zhao;Sheng Ma;Yuhua Tang;Hengzhu Liu;Dongsheng Li","doi":"10.1109/TCAD.2024.3522879","DOIUrl":null,"url":null,"abstract":"The processing-in-memory (PIM) architecture becomes a promising candidate for deep learning accelerators by integrating computation and memory. Most PIM-based studies improve the performance and energy efficiency by using the weight stationary (WS) data flow due to its high parallelism. However, the WS data flow has some fundamental limitations. First, the WS data flow has huge activation movements between on-chip memory and off-chip memory due to the limited memory space of the resistive random-access memory (ReRAM) array. Second, the WS data flow needs to read the input activation repeatedly according to the convolution window. These data movements decrease the energy efficiency and performance of the PIM architecture. To address these issues, the input stationary (IS) data flow stores activations instead of weights to reduce data movements. But the IS data flow faces some challenges. First, the data dependency between adjacent layers limits the performance. Second, there are huge across-array computations due to the special mapping method. Third, the previous IS data flow cannot realize the high parallelism. Fourth, the IS data flow depends on the 3-D ReRAM structure. To address these issues, we propose a novel data flow for PIM architectures. We optimize the IS data flow to decrease the activation movement and propose a parallel computing method to realize high parallelism and reduce the across-array computations. We identify and analyze the fundamental limitations and impact of different interlayer data flows, including the WS-WS, IS-IS, WS-IS, and IS-WS. We also propose a method to build a hybrid data flow by combining these interlayer data flows to tradeoff performance and energy consumption. Our experimental results and analysis demonstrate the potential of our design. The performance and energy efficiency of our design reach 0.13–1.77 TFLOPS and 61–85 TOPS/J, respectively. Compared to the state-of-the-art design, the NEBULA, our design can improve performance by <inline-formula> <tex-math>$1.4\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$2.3\\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math>$3.5\\times $ </tex-math></inline-formula> for deploying the MobileNet-V1, ResNet-18, and VGG-16, and also can improve energy efficiency by <inline-formula> <tex-math>$3.3\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$2\\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math>$2\\times $ </tex-math></inline-formula>, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 7","pages":"2530-2543"},"PeriodicalIF":2.9000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10816195","citationCount":"0","resultStr":"{\"title\":\"Tradeoff Performance and Energy Efficiency by Optimizing the Data Flow for PIM Architectures\",\"authors\":\"Yunping Zhao;Sheng Ma;Yuhua Tang;Hengzhu Liu;Dongsheng Li\",\"doi\":\"10.1109/TCAD.2024.3522879\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The processing-in-memory (PIM) architecture becomes a promising candidate for deep learning accelerators by integrating computation and memory. Most PIM-based studies improve the performance and energy efficiency by using the weight stationary (WS) data flow due to its high parallelism. However, the WS data flow has some fundamental limitations. First, the WS data flow has huge activation movements between on-chip memory and off-chip memory due to the limited memory space of the resistive random-access memory (ReRAM) array. Second, the WS data flow needs to read the input activation repeatedly according to the convolution window. These data movements decrease the energy efficiency and performance of the PIM architecture. To address these issues, the input stationary (IS) data flow stores activations instead of weights to reduce data movements. But the IS data flow faces some challenges. First, the data dependency between adjacent layers limits the performance. Second, there are huge across-array computations due to the special mapping method. Third, the previous IS data flow cannot realize the high parallelism. Fourth, the IS data flow depends on the 3-D ReRAM structure. To address these issues, we propose a novel data flow for PIM architectures. We optimize the IS data flow to decrease the activation movement and propose a parallel computing method to realize high parallelism and reduce the across-array computations. We identify and analyze the fundamental limitations and impact of different interlayer data flows, including the WS-WS, IS-IS, WS-IS, and IS-WS. We also propose a method to build a hybrid data flow by combining these interlayer data flows to tradeoff performance and energy consumption. Our experimental results and analysis demonstrate the potential of our design. The performance and energy efficiency of our design reach 0.13–1.77 TFLOPS and 61–85 TOPS/J, respectively. Compared to the state-of-the-art design, the NEBULA, our design can improve performance by <inline-formula> <tex-math>$1.4\\\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$2.3\\\\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math>$3.5\\\\times $ </tex-math></inline-formula> for deploying the MobileNet-V1, ResNet-18, and VGG-16, and also can improve energy efficiency by <inline-formula> <tex-math>$3.3\\\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$2\\\\times $ </tex-math></inline-formula>, and <inline-formula> <tex-math>$2\\\\times $ </tex-math></inline-formula>, respectively.\",\"PeriodicalId\":13251,\"journal\":{\"name\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"volume\":\"44 7\",\"pages\":\"2530-2543\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-12-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10816195\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10816195/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10816195/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Tradeoff Performance and Energy Efficiency by Optimizing the Data Flow for PIM Architectures
The processing-in-memory (PIM) architecture becomes a promising candidate for deep learning accelerators by integrating computation and memory. Most PIM-based studies improve the performance and energy efficiency by using the weight stationary (WS) data flow due to its high parallelism. However, the WS data flow has some fundamental limitations. First, the WS data flow has huge activation movements between on-chip memory and off-chip memory due to the limited memory space of the resistive random-access memory (ReRAM) array. Second, the WS data flow needs to read the input activation repeatedly according to the convolution window. These data movements decrease the energy efficiency and performance of the PIM architecture. To address these issues, the input stationary (IS) data flow stores activations instead of weights to reduce data movements. But the IS data flow faces some challenges. First, the data dependency between adjacent layers limits the performance. Second, there are huge across-array computations due to the special mapping method. Third, the previous IS data flow cannot realize the high parallelism. Fourth, the IS data flow depends on the 3-D ReRAM structure. To address these issues, we propose a novel data flow for PIM architectures. We optimize the IS data flow to decrease the activation movement and propose a parallel computing method to realize high parallelism and reduce the across-array computations. We identify and analyze the fundamental limitations and impact of different interlayer data flows, including the WS-WS, IS-IS, WS-IS, and IS-WS. We also propose a method to build a hybrid data flow by combining these interlayer data flows to tradeoff performance and energy consumption. Our experimental results and analysis demonstrate the potential of our design. The performance and energy efficiency of our design reach 0.13–1.77 TFLOPS and 61–85 TOPS/J, respectively. Compared to the state-of-the-art design, the NEBULA, our design can improve performance by $1.4\times $ , $2.3\times $ , and $3.5\times $ for deploying the MobileNet-V1, ResNet-18, and VGG-16, and also can improve energy efficiency by $3.3\times $ , $2\times $ , and $2\times $ , respectively.
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.