DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI:10.1109/IPDPS49936.2021.00024

Minjia Zhang, Zehua Hu, Mingqin Li

{"title":"DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture","authors":"Minjia Zhang, Zehua Hu, Mingqin Li","doi":"10.1109/IPDPS49936.2021.00024","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNNs) are currently the foundation for many artificial intelligence tasks. Existing DL frameworks and compilers often focus on optimizing DL inference speed against CPUs and GPUs in isolation while missing the opportunities to reap the benefits of aggregated computation power from both CPU and GPU. We show that there are DNNs that exhibit complex computation patterns, and different components might be suitable for executing on different types of devices to maximize performance gains. Based on this observation, we present a DNN inference engine, called DUET, that explores potential concurrent execution opportunities on heterogeneous CPU-GPU architecture for DNN inference. In particular, we introduce (i) a coarse-grained partitioning strategy that divides a DNN computation graph into subgraphs that retain high computational granularity with relatively low communication volume, (ii) a compiler-aware profiling method to include DL compiler optimization into the loop to improve scheduling decisions, and (iii) a greedy-correction subgraph scheduling algorithm that automatically maps the DNN computation to CPU and GPU without input from model developers. We evaluate DUET against several DNNs that exhibit complex model structures and compare its performance against existing DL frameworks and the state-of-the-art DNN compiler. The experiment results show that DUET is much faster than existing DL frameworks and obtains 1.5–2.3 times and 1.3–6.4 times speed-ups against the optimized code by the state-of-the-art DNN compiler on GPU and CPU alone, respectively.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"121 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Deep neural networks (DNNs) are currently the foundation for many artificial intelligence tasks. Existing DL frameworks and compilers often focus on optimizing DL inference speed against CPUs and GPUs in isolation while missing the opportunities to reap the benefits of aggregated computation power from both CPU and GPU. We show that there are DNNs that exhibit complex computation patterns, and different components might be suitable for executing on different types of devices to maximize performance gains. Based on this observation, we present a DNN inference engine, called DUET, that explores potential concurrent execution opportunities on heterogeneous CPU-GPU architecture for DNN inference. In particular, we introduce (i) a coarse-grained partitioning strategy that divides a DNN computation graph into subgraphs that retain high computational granularity with relatively low communication volume, (ii) a compiler-aware profiling method to include DL compiler optimization into the loop to improve scheduling decisions, and (iii) a greedy-correction subgraph scheduling algorithm that automatically maps the DNN computation to CPU and GPU without input from model developers. We evaluate DUET against several DNNs that exhibit complex model structures and compare its performance against existing DL frameworks and the state-of-the-art DNN compiler. The experiment results show that DUET is much faster than existing DL frameworks and obtains 1.5–2.3 times and 1.3–6.4 times speed-ups against the optimized code by the state-of-the-art DNN compiler on GPU and CPU alone, respectively.

查看原文本刊更多论文

基于CPU-GPU耦合架构的张量程序的编译-运行子图调度方法

深度神经网络(dnn)目前是许多人工智能任务的基础。现有的深度学习框架和编译器通常专注于针对CPU和GPU单独优化深度学习推理速度，而错过了从CPU和GPU的聚合计算能力中获益的机会。我们表明，有些dnn表现出复杂的计算模式，不同的组件可能适合在不同类型的设备上执行，以最大限度地提高性能。基于这一观察，我们提出了一个DNN推理引擎，称为DUET，它探索了在异构CPU-GPU架构上进行DNN推理的潜在并发执行机会。特别是，我们引入了(i)一种粗粒度分区策略，该策略将DNN计算图划分为保持高计算粒度且通信量相对较低的子图，(ii)一种编译器感知的分析方法，将DL编译器优化纳入循环以改进调度决策，以及(iii)一种贪婪校正子图调度算法，该算法自动将DNN计算映射到CPU和GPU，而无需模型开发人员的输入。我们针对几种表现出复杂模型结构的DNN评估DUET，并将其性能与现有的深度学习框架和最先进的DNN编译器进行比较。实验结果表明，DUET比现有的深度学习框架快得多，相对于最先进的DNN编译器在GPU和CPU上优化的代码，DUET分别获得1.5-2.3倍和1.3-6.4倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量