{"title":"DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture","authors":"Minjia Zhang, Zehua Hu, Mingqin Li","doi":"10.1109/IPDPS49936.2021.00024","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNNs) are currently the foundation for many artificial intelligence tasks. Existing DL frameworks and compilers often focus on optimizing DL inference speed against CPUs and GPUs in isolation while missing the opportunities to reap the benefits of aggregated computation power from both CPU and GPU. We show that there are DNNs that exhibit complex computation patterns, and different components might be suitable for executing on different types of devices to maximize performance gains. Based on this observation, we present a DNN inference engine, called DUET, that explores potential concurrent execution opportunities on heterogeneous CPU-GPU architecture for DNN inference. In particular, we introduce (i) a coarse-grained partitioning strategy that divides a DNN computation graph into subgraphs that retain high computational granularity with relatively low communication volume, (ii) a compiler-aware profiling method to include DL compiler optimization into the loop to improve scheduling decisions, and (iii) a greedy-correction subgraph scheduling algorithm that automatically maps the DNN computation to CPU and GPU without input from model developers. We evaluate DUET against several DNNs that exhibit complex model structures and compare its performance against existing DL frameworks and the state-of-the-art DNN compiler. The experiment results show that DUET is much faster than existing DL frameworks and obtains 1.5–2.3 times and 1.3–6.4 times speed-ups against the optimized code by the state-of-the-art DNN compiler on GPU and CPU alone, respectively.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"121 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Deep neural networks (DNNs) are currently the foundation for many artificial intelligence tasks. Existing DL frameworks and compilers often focus on optimizing DL inference speed against CPUs and GPUs in isolation while missing the opportunities to reap the benefits of aggregated computation power from both CPU and GPU. We show that there are DNNs that exhibit complex computation patterns, and different components might be suitable for executing on different types of devices to maximize performance gains. Based on this observation, we present a DNN inference engine, called DUET, that explores potential concurrent execution opportunities on heterogeneous CPU-GPU architecture for DNN inference. In particular, we introduce (i) a coarse-grained partitioning strategy that divides a DNN computation graph into subgraphs that retain high computational granularity with relatively low communication volume, (ii) a compiler-aware profiling method to include DL compiler optimization into the loop to improve scheduling decisions, and (iii) a greedy-correction subgraph scheduling algorithm that automatically maps the DNN computation to CPU and GPU without input from model developers. We evaluate DUET against several DNNs that exhibit complex model structures and compare its performance against existing DL frameworks and the state-of-the-art DNN compiler. The experiment results show that DUET is much faster than existing DL frameworks and obtains 1.5–2.3 times and 1.3–6.4 times speed-ups against the optimized code by the state-of-the-art DNN compiler on GPU and CPU alone, respectively.