Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution

IF 0.5 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING

International Journal of Embedded and Real-Time Communication Systems (IJERTCS) Pub Date : 2022-08-01 DOI:10.1109/RTCSA55878.2022.00027

Hao Li, J. Ng, T. Abdelzaher

{"title":"Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution","authors":"Hao Li, J. Ng, T. Abdelzaher","doi":"10.1109/RTCSA55878.2022.00027","DOIUrl":null,"url":null,"abstract":"AI-powered mobile applications are becoming increasingly popular due to recent advances in machine intelligence. They include, but are not limited to mobile sensing, virtual assistants, and augmented reality. Mobile AI models, especially Deep Neural Networks (DNN), are usually executed locally, as sensory data are collected and generated by end devices. This imposes a heavy computational burden on the resource-constrained mobile phones. There are usually a set of DNN jobs with deadline constraints waiting for execution. Existing AI inference frameworks process incoming DNN jobs in sequential order, which does not optimally support mobile users’ real-time interactions with AI services. In this paper, we propose a framework to achieve real-time inference by exploring the heterogeneous mobile SoCs, which contain a CPU and a GPU. Considering characteristics of DNN models, we optimally partition the execution between the mobile GPU and CPU. We present a dynamic programming-based approach to solve the formulated real-time DNN partitioning and scheduling problem. The proposed framework has several desirable properties: 1) computational resources on mobile devices are better utilized; 2) it optimizes inference performance in terms of deadline miss rate; 3) no sacrifices in inference accuracy are made. Evaluation results on an off-the-shelf mobile phone show that our proposed framework can provide better real-time support for AI inference tasks on mobile platforms, compared to several baselines.","PeriodicalId":38446,"journal":{"name":"International Journal of Embedded and Real-Time Communication Systems (IJERTCS)","volume":"29 1","pages":"195-204"},"PeriodicalIF":0.5000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Embedded and Real-Time Communication Systems (IJERTCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RTCSA55878.2022.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 2

Abstract

AI-powered mobile applications are becoming increasingly popular due to recent advances in machine intelligence. They include, but are not limited to mobile sensing, virtual assistants, and augmented reality. Mobile AI models, especially Deep Neural Networks (DNN), are usually executed locally, as sensory data are collected and generated by end devices. This imposes a heavy computational burden on the resource-constrained mobile phones. There are usually a set of DNN jobs with deadline constraints waiting for execution. Existing AI inference frameworks process incoming DNN jobs in sequential order, which does not optimally support mobile users’ real-time interactions with AI services. In this paper, we propose a framework to achieve real-time inference by exploring the heterogeneous mobile SoCs, which contain a CPU and a GPU. Considering characteristics of DNN models, we optimally partition the execution between the mobile GPU and CPU. We present a dynamic programming-based approach to solve the formulated real-time DNN partitioning and scheduling problem. The proposed framework has several desirable properties: 1) computational resources on mobile devices are better utilized; 2) it optimizes inference performance in terms of deadline miss rate; 3) no sacrifices in inference accuracy are made. Evaluation results on an off-the-shelf mobile phone show that our proposed framework can provide better real-time support for AI inference tasks on mobile platforms, compared to several baselines.

查看原文本刊更多论文

通过GPU-CPU协同执行在移动设备上实现实时AI推理

由于最近机器智能的进步，人工智能驱动的移动应用程序正变得越来越受欢迎。它们包括但不限于移动传感、虚拟助手和增强现实。移动人工智能模型，尤其是深度神经网络(DNN)，通常在本地执行，因为感知数据是由终端设备收集和生成的。这给资源有限的移动电话带来了沉重的计算负担。通常有一组DNN作业具有等待执行的截止日期限制。现有的人工智能推理框架按顺序处理传入的DNN任务，这并不能最佳地支持移动用户与人工智能服务的实时交互。在本文中，我们提出了一个框架来实现实时推理的异构移动soc，其中包含一个CPU和一个GPU。考虑到深度神经网络模型的特点，我们优化了移动GPU和CPU之间的执行分区。我们提出了一种基于动态规划的方法来解决制定的实时DNN划分和调度问题。提出的框架有几个可取的特性:1)移动设备上的计算资源得到更好的利用;2)从截止日期缺失率方面优化推理性能;3)不牺牲推理精度。在一个现成的手机上的评估结果表明，与几个基线相比，我们提出的框架可以为移动平台上的AI推理任务提供更好的实时支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Embedded and Real-Time Communication Systems (IJERTCS) Multiple-

CiteScore

1.70

自引率

14.30%

发文量