识别CPU-PIM图形处理加速器的最佳工作负载卸载分区

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2025-01-15 DOI:10.1109/TVLSI.2025.3526201

Sheng Xu;Chun Li;Le Luo;Wu Zhou;Liang Yan;Xiaoming Chen

{"title":"识别CPU-PIM图形处理加速器的最佳工作负载卸载分区","authors":"Sheng Xu;Chun Li;Le Luo;Wu Zhou;Liang Yan;Xiaoming Chen","doi":"10.1109/TVLSI.2025.3526201","DOIUrl":null,"url":null,"abstract":"The integrated architecture that features both in-memory logic and host processors, or so-called “processing-in-memory” (PIM) architecture, is an emerging and promising solution to bridge the performance gap between the memory and host processors. In spite of the considerable potential of PIM, the workload offloading policy, which partitions the program and determines where code snippets are executed, is still a main challenge in PIM. In order to determine the best PIM offloading partitions, existing methods require in-depth program profiling to create the control flow graph (CFG) and then transform it into a graph-cut problem. These CFG-based solutions depend on detailed profiling of a crucial element, the execution time of basic blocks, to accurately assess the benefits of PIM offloading. The issue is that these execution times can change significantly in PIM, leading to inaccurate offloading decisions. To tackle this challenge, we present a novel PIM workload offloading framework called “RDPIM” for CPU-PIM graph processing accelerators, which systematically considers the variations in the execution time of basic blocks. By analyzing the relationship between data dependencies among workloads and the connectivity of input graphs, we identified three key features that can lead to variations in execution time. We developed a novel reuse distance (RD)-based model to predict the exact performance of basic blocks for optimal offloading decisions. We evaluate RDPIM using real-world graphs and compare it with some state-of-the-art PIM offloading approaches. Experiments have demonstrated that our method achieves an average speedup of <inline-formula> <tex-math>$2\\times $ </tex-math></inline-formula> compared to CPU-only executions and up to <inline-formula> <tex-math>$1.6\\times $ </tex-math></inline-formula> compared to state-of-the-art PIM offloading schemes.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 4","pages":"1053-1064"},"PeriodicalIF":2.8000,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Identifying Optimal Workload Offloading Partitions for CPU-PIM Graph Processing Accelerators\",\"authors\":\"Sheng Xu;Chun Li;Le Luo;Wu Zhou;Liang Yan;Xiaoming Chen\",\"doi\":\"10.1109/TVLSI.2025.3526201\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The integrated architecture that features both in-memory logic and host processors, or so-called “processing-in-memory” (PIM) architecture, is an emerging and promising solution to bridge the performance gap between the memory and host processors. In spite of the considerable potential of PIM, the workload offloading policy, which partitions the program and determines where code snippets are executed, is still a main challenge in PIM. In order to determine the best PIM offloading partitions, existing methods require in-depth program profiling to create the control flow graph (CFG) and then transform it into a graph-cut problem. These CFG-based solutions depend on detailed profiling of a crucial element, the execution time of basic blocks, to accurately assess the benefits of PIM offloading. The issue is that these execution times can change significantly in PIM, leading to inaccurate offloading decisions. To tackle this challenge, we present a novel PIM workload offloading framework called “RDPIM” for CPU-PIM graph processing accelerators, which systematically considers the variations in the execution time of basic blocks. By analyzing the relationship between data dependencies among workloads and the connectivity of input graphs, we identified three key features that can lead to variations in execution time. We developed a novel reuse distance (RD)-based model to predict the exact performance of basic blocks for optimal offloading decisions. We evaluate RDPIM using real-world graphs and compare it with some state-of-the-art PIM offloading approaches. Experiments have demonstrated that our method achieves an average speedup of <inline-formula> <tex-math>$2\\\\times $ </tex-math></inline-formula> compared to CPU-only executions and up to <inline-formula> <tex-math>$1.6\\\\times $ </tex-math></inline-formula> compared to state-of-the-art PIM offloading schemes.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 4\",\"pages\":\"1053-1064\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10843132/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10843132/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

集成的体系结构既具有内存逻辑和主机处理器的特性，也称为“内存中处理”（PIM）体系结构，是一种新兴的、有前途的解决方案，可以弥合内存和主机处理器之间的性能差距。尽管PIM具有相当大的潜力，但是工作负载卸载策略仍然是PIM中的一个主要挑战。负载卸载策略对程序进行分区，并决定在哪里执行代码片段。为了确定最佳的PIM卸载分区，现有的方法需要深入的程序分析来创建控制流图（CFG），然后将其转换为图切问题。这些基于cfg的解决方案依赖于一个关键元素的详细分析，即基本区块的执行时间，以准确评估PIM卸载的好处。问题是，这些执行时间在PIM中可能会发生显著变化，从而导致不准确的卸载决策。为了应对这一挑战，我们提出了一种新的PIM工作负载卸载框架，称为“RDPIM”，用于CPU-PIM图形处理加速器，该框架系统地考虑了基本块执行时间的变化。通过分析工作负载之间的数据依赖性和输入图的连接性之间的关系，我们确定了可能导致执行时间变化的三个关键特性。我们开发了一种新的基于重用距离（RD）的模型来预测基本块的准确性能，以实现最优卸载决策。我们使用真实世界的图来评估RDPIM，并将其与一些最先进的PIM卸载方法进行比较。实验表明，与仅使用cpu执行相比，我们的方法实现了平均2倍的加速，与最先进的PIM卸载方案相比，实现了高达1.6倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Identifying Optimal Workload Offloading Partitions for CPU-PIM Graph Processing Accelerators

The integrated architecture that features both in-memory logic and host processors, or so-called “processing-in-memory” (PIM) architecture, is an emerging and promising solution to bridge the performance gap between the memory and host processors. In spite of the considerable potential of PIM, the workload offloading policy, which partitions the program and determines where code snippets are executed, is still a main challenge in PIM. In order to determine the best PIM offloading partitions, existing methods require in-depth program profiling to create the control flow graph (CFG) and then transform it into a graph-cut problem. These CFG-based solutions depend on detailed profiling of a crucial element, the execution time of basic blocks, to accurately assess the benefits of PIM offloading. The issue is that these execution times can change significantly in PIM, leading to inaccurate offloading decisions. To tackle this challenge, we present a novel PIM workload offloading framework called “RDPIM” for CPU-PIM graph processing accelerators, which systematically considers the variations in the execution time of basic blocks. By analyzing the relationship between data dependencies among workloads and the connectivity of input graphs, we identified three key features that can lead to variations in execution time. We developed a novel reuse distance (RD)-based model to predict the exact performance of basic blocks for optimal offloading decisions. We evaluate RDPIM using real-world graphs and compare it with some state-of-the-art PIM offloading approaches. Experiments have demonstrated that our method achieves an average speedup of

$2\times $

compared to CPU-only executions and up to

$1.6\times $

compared to state-of-the-art PIM offloading schemes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.