A Task Parallelism Runtime Solution for Deep Learning Applications using MPSoC on Edge Devices

2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC) Pub Date : 2022-01-17 DOI:10.1109/asp-dac52403.2022.9712581

Hua Jiang, Raghav Chakravarthy, Ravikumar V. Chakaravarthy

{"title":"A Task Parallelism Runtime Solution for Deep Learning Applications using MPSoC on Edge Devices","authors":"Hua Jiang, Raghav Chakravarthy, Ravikumar V. Chakaravarthy","doi":"10.1109/asp-dac52403.2022.9712581","DOIUrl":null,"url":null,"abstract":"AI on edge devices [1]–[4] are becoming increasing popular over the last few years. There are many research projects TVM [5] TensorFlow lite [6] that have focused on deployment and acceleration of AI/ML models on edge devices. These solutions have predominantly used data parallelism to accelerate AI/ML models on the edge device using operator fusion, nested parallelism, memory latency hiding [5] etc. to achieve best performance on the supported hardware backends. However, when the hardware supports multiple heterogenous hardware backends it becomes important to support task parallelism in addition to data parallelism to achieve optimal performance. Tasks level parallelism [7] [8] helps break down an AI/ML model into multiple tasks that can be scheduled across various heterogenous backends available in a multi-processor system on chip (MPSoC). In our proposed solution we take an AI/ML compute graph and break it into a directed acyclic graph (DAG) such that each node of the DAG represents a sub-graph of the original compute graph. The nodes of the DAG are generated using an auto-tuner to achieve optimal performance for the corresponding hardware backend. The nodes are compiled into a binary executable for the targeted hardware backend and we are extending our machine learning framework, XTA [9], to generate DAG. The XTA runtime will analyze the DAG and generate scheduling configuration. The nodes of the DAG are analyzed for dependencies and parallelized or pipelined accordingly. We are seeing a 30% improvement over the current solutions by parallelizing the execution of nodes in the DAG. The performance can be further optimized by using more hardware backend cores of the MPSoC to execute the nodes of the DAG in parallel, which is missing in the existing solutions.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/asp-dac52403.2022.9712581","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

AI on edge devices [1]–[4] are becoming increasing popular over the last few years. There are many research projects TVM [5] TensorFlow lite [6] that have focused on deployment and acceleration of AI/ML models on edge devices. These solutions have predominantly used data parallelism to accelerate AI/ML models on the edge device using operator fusion, nested parallelism, memory latency hiding [5] etc. to achieve best performance on the supported hardware backends. However, when the hardware supports multiple heterogenous hardware backends it becomes important to support task parallelism in addition to data parallelism to achieve optimal performance. Tasks level parallelism [7] [8] helps break down an AI/ML model into multiple tasks that can be scheduled across various heterogenous backends available in a multi-processor system on chip (MPSoC). In our proposed solution we take an AI/ML compute graph and break it into a directed acyclic graph (DAG) such that each node of the DAG represents a sub-graph of the original compute graph. The nodes of the DAG are generated using an auto-tuner to achieve optimal performance for the corresponding hardware backend. The nodes are compiled into a binary executable for the targeted hardware backend and we are extending our machine learning framework, XTA [9], to generate DAG. The XTA runtime will analyze the DAG and generate scheduling configuration. The nodes of the DAG are analyzed for dependencies and parallelized or pipelined accordingly. We are seeing a 30% improvement over the current solutions by parallelizing the execution of nodes in the DAG. The performance can be further optimized by using more hardware backend cores of the MPSoC to execute the nodes of the DAG in parallel, which is missing in the existing solutions.

查看原文本刊更多论文

在边缘设备上使用MPSoC的深度学习应用的任务并行运行时解决方案

在过去的几年里，边缘设备上的人工智能[1]-[4]正变得越来越流行。有许多研究项目TVM [5] TensorFlow lite[6]专注于在边缘设备上部署和加速AI/ML模型。这些解决方案主要使用数据并行性来加速边缘设备上的AI/ML模型，使用算子融合，嵌套并行性，内存延迟隐藏[5]等，以在支持的硬件后端上实现最佳性能。然而，当硬件支持多个异构硬件后端时，除了支持数据并行性外，还必须支持任务并行性，以实现最佳性能。任务级并行性[7][8]有助于将AI/ML模型分解为多个任务，这些任务可以在多处理器片上系统(MPSoC)中跨各种异构后端进行调度。在我们提出的解决方案中，我们采用AI/ML计算图并将其分解为有向无环图(DAG)，这样DAG的每个节点都代表原始计算图的一个子图。DAG的节点是使用自动调谐器生成的，以便为相应的硬件后端实现最佳性能。节点被编译成目标硬件后端的二进制可执行文件，我们正在扩展我们的机器学习框架XTA[9]，以生成DAG。XTA运行时将分析DAG并生成调度配置。分析DAG的节点的依赖关系，并相应地并行化或流水线化。通过在DAG中并行执行节点，我们看到比当前解决方案提高了30%。通过使用MPSoC的更多硬件后端内核并行执行DAG节点，可以进一步优化性能，这在现有解决方案中是缺失的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)

自引率

0.00%

发文量