{"title":"Pipelined Data-Parallel CPU/GPU Scheduling for Multi-DNN Real-Time Inference","authors":"Yecheng Xiang, Hyoseung Kim","doi":"10.1109/RTSS46320.2019.00042","DOIUrl":null,"url":null,"abstract":"Deep neural networks (DNNs) have been showing significant success in various applications, such as autonomous driving, mobile devices, and Internet of Things. Although much research has been conducted to optimize the structure of DNNs, limited attention has been given to their timely execution, specifically on the scheduling of real-time inference requests to various DNN models. For instance, existing DNN frameworks, such as Caffe, TensorFlow and Torch, only provide a single-level priority, one-DNN-per-process execution model and sequential inference interfaces. They can be particularly problematic when used in edge computing and in-vehicle intelligence systems for multiple DNNs, as response time may become unpredictably long in the worst case while leaving system resources underutilized. This paper presents DART, a DNN scheduling framework that offers deterministic response time to real-time tasks and increased throughput to best-effort tasks. DART employs a pipeline-based scheduling architecture with data parallelism, where heterogeneous CPUs and GPUs are arranged into nodes with different parallelism levels. DART also includes pipeline stage design and node configuration schemes, admission control, execution time profiling, and runtime enforcement techniques. We evaluated DART on Intel x86 Xeon and Nvidia ARM platforms with GPUs. Experimental results indicate that DART significantly outperforms the existing approaches, by up to 98.5% shorter worst-case response time for real-time tasks while simultaneously achieving up to 17.9% higher throughput for best-effort tasks.","PeriodicalId":102892,"journal":{"name":"2019 IEEE Real-Time Systems Symposium (RTSS)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"90","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Real-Time Systems Symposium (RTSS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RTSS46320.2019.00042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 90
Abstract
Deep neural networks (DNNs) have been showing significant success in various applications, such as autonomous driving, mobile devices, and Internet of Things. Although much research has been conducted to optimize the structure of DNNs, limited attention has been given to their timely execution, specifically on the scheduling of real-time inference requests to various DNN models. For instance, existing DNN frameworks, such as Caffe, TensorFlow and Torch, only provide a single-level priority, one-DNN-per-process execution model and sequential inference interfaces. They can be particularly problematic when used in edge computing and in-vehicle intelligence systems for multiple DNNs, as response time may become unpredictably long in the worst case while leaving system resources underutilized. This paper presents DART, a DNN scheduling framework that offers deterministic response time to real-time tasks and increased throughput to best-effort tasks. DART employs a pipeline-based scheduling architecture with data parallelism, where heterogeneous CPUs and GPUs are arranged into nodes with different parallelism levels. DART also includes pipeline stage design and node configuration schemes, admission control, execution time profiling, and runtime enforcement techniques. We evaluated DART on Intel x86 Xeon and Nvidia ARM platforms with GPUs. Experimental results indicate that DART significantly outperforms the existing approaches, by up to 98.5% shorter worst-case response time for real-time tasks while simultaneously achieving up to 17.9% higher throughput for best-effort tasks.
深度神经网络(dnn)在自动驾驶、移动设备、物联网等各种应用中取得了巨大的成功。尽管已经进行了大量的研究来优化DNN的结构,但对其及时执行的关注有限,特别是对各种DNN模型的实时推理请求的调度。例如,现有的DNN框架,如Caffe, TensorFlow和Torch,只提供单级优先级,每进程一个DNN执行模型和顺序推理接口。当在边缘计算和车载智能系统中用于多个dnn时,它们可能会出现特别的问题,因为在最坏的情况下,响应时间可能会变得不可预测地长,同时使系统资源得不到充分利用。本文介绍了DART,一种深度神经网络调度框架,它为实时任务提供了确定性的响应时间,并为最佳努力任务提供了更高的吞吐量。DART采用基于管道的数据并行调度架构,将异构cpu和gpu安排在不同并行度的节点中。DART还包括管道阶段设计和节点配置方案、准入控制、执行时间分析和运行时执行技术。我们在带有gpu的Intel x86 Xeon和Nvidia ARM平台上评估了DART。实验结果表明,DART显著优于现有的方法,实时任务的最坏情况响应时间缩短了98.5%,同时在最大努力任务上实现了17.9%的高吞吐量。