{"title":"GPOS:一种基于OCP和NDP协同优化的DNN加速高通用性的通用精确卸载策略","authors":"Zixu Li;Wang Wang;Manni Li;Jiayu Yang;Zijian Huang;Xin Zhong;Yinyin Lin;Chengchen Wang;Xiankui Xiong","doi":"10.1109/TCAD.2025.3555184","DOIUrl":null,"url":null,"abstract":"The arithmetic intensity (ArI) of different DNNs can be opposite. This challenges the generality of single acceleration architectures, including both dedicated on-chip processing (OCP) and near-data processing (NDP). Neither architecture can simultaneously achieve optimal energy efficiency and performance for operators with opposite ArI. It is relatively straightforward to think of combining the respective advantages of OCP and NDP. However, few publications have addressed their real-time co-optimization, primarily due to the lack of a quantifiable offloading method. Here, we propose GPOS, a general and precise offloading strategy that supports high generality of DNN acceleration. GPOS comprehensively considers the complex interactions between OCP and NDP, including hardware configurations, dataflow (DF), DNN model, and interdie data movements (DMs). Three quantifiable indicators—ArI, execution cost (Ex-cost), and DM-cost—are employed to precisely evaluate the impacts of these interactions on energy and latency. GPOS adopts a four-step flow with progressive refinement: each of the first three steps focuses on a single indicator at the operator level, while the final step performs context-based calibration to address operator interdependencies and avoid offsetting NDP benefits. Narrowing down offloading candidates in step 1 and step 3 significantly accelerates real-time quantitative analysis. Optimized mapping techniques and NDP-input stationary DF are proposed to reduce Ex-cost and extend operator types supported by NDP. Next, for the first time, sparsity—one of the most popular methods for energy optimization that can alter data reuse or ArI—is quantitatively investigated for its impacts on offloading using GPOS. Our evaluations include representative DNNs, including GPT-2, Bert, RNN, CNN, and MLP. GPOS achieves the minimum energy and latency for each benchmark, with geometric mean speedups of 49.0% and 94.1%, and geometric mean energy savings of 45.8% and 89.2% over All-OCP and All-NDP, respectively. GPOS also reduces offloading analysis latency by a geometric mean of 92.7% compared to the evaluation that traverses each operator and its relative combinations. On average, sparsity further improves performance and energy efficiency by increasing the number of operators offloaded to NDP. However, for DNNs where all operators exhibit either very high or very low ArI, the number of offloaded operators remains unchanged, even after sparsity is applied.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3776-3789"},"PeriodicalIF":2.9000,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GPOS: A General and Precise Offloading Strategy for High Generality of DNN Acceleration by OCP and NDP Co-Optimizing\",\"authors\":\"Zixu Li;Wang Wang;Manni Li;Jiayu Yang;Zijian Huang;Xin Zhong;Yinyin Lin;Chengchen Wang;Xiankui Xiong\",\"doi\":\"10.1109/TCAD.2025.3555184\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The arithmetic intensity (ArI) of different DNNs can be opposite. This challenges the generality of single acceleration architectures, including both dedicated on-chip processing (OCP) and near-data processing (NDP). Neither architecture can simultaneously achieve optimal energy efficiency and performance for operators with opposite ArI. It is relatively straightforward to think of combining the respective advantages of OCP and NDP. However, few publications have addressed their real-time co-optimization, primarily due to the lack of a quantifiable offloading method. Here, we propose GPOS, a general and precise offloading strategy that supports high generality of DNN acceleration. GPOS comprehensively considers the complex interactions between OCP and NDP, including hardware configurations, dataflow (DF), DNN model, and interdie data movements (DMs). Three quantifiable indicators—ArI, execution cost (Ex-cost), and DM-cost—are employed to precisely evaluate the impacts of these interactions on energy and latency. GPOS adopts a four-step flow with progressive refinement: each of the first three steps focuses on a single indicator at the operator level, while the final step performs context-based calibration to address operator interdependencies and avoid offsetting NDP benefits. Narrowing down offloading candidates in step 1 and step 3 significantly accelerates real-time quantitative analysis. Optimized mapping techniques and NDP-input stationary DF are proposed to reduce Ex-cost and extend operator types supported by NDP. Next, for the first time, sparsity—one of the most popular methods for energy optimization that can alter data reuse or ArI—is quantitatively investigated for its impacts on offloading using GPOS. Our evaluations include representative DNNs, including GPT-2, Bert, RNN, CNN, and MLP. GPOS achieves the minimum energy and latency for each benchmark, with geometric mean speedups of 49.0% and 94.1%, and geometric mean energy savings of 45.8% and 89.2% over All-OCP and All-NDP, respectively. GPOS also reduces offloading analysis latency by a geometric mean of 92.7% compared to the evaluation that traverses each operator and its relative combinations. On average, sparsity further improves performance and energy efficiency by increasing the number of operators offloaded to NDP. However, for DNNs where all operators exhibit either very high or very low ArI, the number of offloaded operators remains unchanged, even after sparsity is applied.\",\"PeriodicalId\":13251,\"journal\":{\"name\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"volume\":\"44 10\",\"pages\":\"3776-3789\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10939009/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10939009/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
GPOS: A General and Precise Offloading Strategy for High Generality of DNN Acceleration by OCP and NDP Co-Optimizing
The arithmetic intensity (ArI) of different DNNs can be opposite. This challenges the generality of single acceleration architectures, including both dedicated on-chip processing (OCP) and near-data processing (NDP). Neither architecture can simultaneously achieve optimal energy efficiency and performance for operators with opposite ArI. It is relatively straightforward to think of combining the respective advantages of OCP and NDP. However, few publications have addressed their real-time co-optimization, primarily due to the lack of a quantifiable offloading method. Here, we propose GPOS, a general and precise offloading strategy that supports high generality of DNN acceleration. GPOS comprehensively considers the complex interactions between OCP and NDP, including hardware configurations, dataflow (DF), DNN model, and interdie data movements (DMs). Three quantifiable indicators—ArI, execution cost (Ex-cost), and DM-cost—are employed to precisely evaluate the impacts of these interactions on energy and latency. GPOS adopts a four-step flow with progressive refinement: each of the first three steps focuses on a single indicator at the operator level, while the final step performs context-based calibration to address operator interdependencies and avoid offsetting NDP benefits. Narrowing down offloading candidates in step 1 and step 3 significantly accelerates real-time quantitative analysis. Optimized mapping techniques and NDP-input stationary DF are proposed to reduce Ex-cost and extend operator types supported by NDP. Next, for the first time, sparsity—one of the most popular methods for energy optimization that can alter data reuse or ArI—is quantitatively investigated for its impacts on offloading using GPOS. Our evaluations include representative DNNs, including GPT-2, Bert, RNN, CNN, and MLP. GPOS achieves the minimum energy and latency for each benchmark, with geometric mean speedups of 49.0% and 94.1%, and geometric mean energy savings of 45.8% and 89.2% over All-OCP and All-NDP, respectively. GPOS also reduces offloading analysis latency by a geometric mean of 92.7% compared to the evaluation that traverses each operator and its relative combinations. On average, sparsity further improves performance and energy efficiency by increasing the number of operators offloaded to NDP. However, for DNNs where all operators exhibit either very high or very low ArI, the number of offloaded operators remains unchanged, even after sparsity is applied.
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.