GPOS: A General and Precise Offloading Strategy for High Generality of DNN Acceleration by OCP and NDP Co-Optimizing

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2025-03-26 DOI:10.1109/TCAD.2025.3555184

Zixu Li;Wang Wang;Manni Li;Jiayu Yang;Zijian Huang;Xin Zhong;Yinyin Lin;Chengchen Wang;Xiankui Xiong

{"title":"GPOS: A General and Precise Offloading Strategy for High Generality of DNN Acceleration by OCP and NDP Co-Optimizing","authors":"Zixu Li;Wang Wang;Manni Li;Jiayu Yang;Zijian Huang;Xin Zhong;Yinyin Lin;Chengchen Wang;Xiankui Xiong","doi":"10.1109/TCAD.2025.3555184","DOIUrl":null,"url":null,"abstract":"The arithmetic intensity (ArI) of different DNNs can be opposite. This challenges the generality of single acceleration architectures, including both dedicated on-chip processing (OCP) and near-data processing (NDP). Neither architecture can simultaneously achieve optimal energy efficiency and performance for operators with opposite ArI. It is relatively straightforward to think of combining the respective advantages of OCP and NDP. However, few publications have addressed their real-time co-optimization, primarily due to the lack of a quantifiable offloading method. Here, we propose GPOS, a general and precise offloading strategy that supports high generality of DNN acceleration. GPOS comprehensively considers the complex interactions between OCP and NDP, including hardware configurations, dataflow (DF), DNN model, and interdie data movements (DMs). Three quantifiable indicators—ArI, execution cost (Ex-cost), and DM-cost—are employed to precisely evaluate the impacts of these interactions on energy and latency. GPOS adopts a four-step flow with progressive refinement: each of the first three steps focuses on a single indicator at the operator level, while the final step performs context-based calibration to address operator interdependencies and avoid offsetting NDP benefits. Narrowing down offloading candidates in step 1 and step 3 significantly accelerates real-time quantitative analysis. Optimized mapping techniques and NDP-input stationary DF are proposed to reduce Ex-cost and extend operator types supported by NDP. Next, for the first time, sparsity—one of the most popular methods for energy optimization that can alter data reuse or ArI—is quantitatively investigated for its impacts on offloading using GPOS. Our evaluations include representative DNNs, including GPT-2, Bert, RNN, CNN, and MLP. GPOS achieves the minimum energy and latency for each benchmark, with geometric mean speedups of 49.0% and 94.1%, and geometric mean energy savings of 45.8% and 89.2% over All-OCP and All-NDP, respectively. GPOS also reduces offloading analysis latency by a geometric mean of 92.7% compared to the evaluation that traverses each operator and its relative combinations. On average, sparsity further improves performance and energy efficiency by increasing the number of operators offloaded to NDP. However, for DNNs where all operators exhibit either very high or very low ArI, the number of offloaded operators remains unchanged, even after sparsity is applied.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3776-3789"},"PeriodicalIF":2.9000,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10939009/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

The arithmetic intensity (ArI) of different DNNs can be opposite. This challenges the generality of single acceleration architectures, including both dedicated on-chip processing (OCP) and near-data processing (NDP). Neither architecture can simultaneously achieve optimal energy efficiency and performance for operators with opposite ArI. It is relatively straightforward to think of combining the respective advantages of OCP and NDP. However, few publications have addressed their real-time co-optimization, primarily due to the lack of a quantifiable offloading method. Here, we propose GPOS, a general and precise offloading strategy that supports high generality of DNN acceleration. GPOS comprehensively considers the complex interactions between OCP and NDP, including hardware configurations, dataflow (DF), DNN model, and interdie data movements (DMs). Three quantifiable indicators—ArI, execution cost (Ex-cost), and DM-cost—are employed to precisely evaluate the impacts of these interactions on energy and latency. GPOS adopts a four-step flow with progressive refinement: each of the first three steps focuses on a single indicator at the operator level, while the final step performs context-based calibration to address operator interdependencies and avoid offsetting NDP benefits. Narrowing down offloading candidates in step 1 and step 3 significantly accelerates real-time quantitative analysis. Optimized mapping techniques and NDP-input stationary DF are proposed to reduce Ex-cost and extend operator types supported by NDP. Next, for the first time, sparsity—one of the most popular methods for energy optimization that can alter data reuse or ArI—is quantitatively investigated for its impacts on offloading using GPOS. Our evaluations include representative DNNs, including GPT-2, Bert, RNN, CNN, and MLP. GPOS achieves the minimum energy and latency for each benchmark, with geometric mean speedups of 49.0% and 94.1%, and geometric mean energy savings of 45.8% and 89.2% over All-OCP and All-NDP, respectively. GPOS also reduces offloading analysis latency by a geometric mean of 92.7% compared to the evaluation that traverses each operator and its relative combinations. On average, sparsity further improves performance and energy efficiency by increasing the number of operators offloaded to NDP. However, for DNNs where all operators exhibit either very high or very low ArI, the number of offloaded operators remains unchanged, even after sparsity is applied.

查看原文本刊更多论文

GPOS：一种基于OCP和NDP协同优化的DNN加速高通用性的通用精确卸载策略

不同深度神经网络的算术强度（ArI）可能相反。这挑战了单加速架构的通用性，包括专用片上处理（OCP）和近数据处理（NDP）。对于ArI相反的操作人员，这两种架构都不能同时达到最佳的能效和性能。将OCP和NDP各自的优势结合起来是相对简单的。然而，很少有出版物解决了他们的实时协同优化，主要是由于缺乏可量化的卸载方法。在这里，我们提出了GPOS，一种通用和精确的卸载策略，支持深度神经网络加速的高通用性。GPOS综合考虑了OCP和NDP之间的复杂交互，包括硬件配置、数据流（DF）、DNN模型和数据移动（dm）。采用ari、执行成本（Ex-cost）和dm -cost三个可量化指标来精确评估这些相互作用对能量和延迟的影响。GPOS采用了四步逐步改进流程：前三步中的每一步都侧重于操作人员级别的单个指标，而最后一步执行基于上下文的校准，以解决操作人员的相互依赖关系，并避免抵消NDP的好处。在步骤1和步骤3中缩小卸载候选对象的范围可以显著加快实时定量分析。提出了优化的映射技术和NDP输入的平稳DF，以减少ex成本和扩展NDP支持的算子类型。接下来，我们将首次定量研究稀疏性（可以改变数据重用或ari的最流行的能源优化方法之一）对使用GPOS卸载的影响。我们的评估包括代表性的深度神经网络，包括GPT-2、Bert、RNN、CNN和MLP。与All-OCP和All-NDP相比，GPOS在每个基准测试中实现了最小的能量和延迟，几何平均加速分别为49.0%和94.1%，几何平均节能分别为45.8%和89.2%。与遍历每个操作符及其相关组合的评估相比，GPOS还将卸载分析延迟减少了92.7%的几何平均值。平均而言，稀疏性通过增加卸载到NDP的作业者数量，进一步提高了性能和能源效率。然而，对于所有操作符都表现出非常高或非常低ArI的dnn，即使在应用稀疏性之后，卸载操作符的数量仍然保持不变。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.