Hua Jiang, Raghav Chakravarthy, Ravikumar V. Chakaravarthy
{"title":"A Task Parallelism Runtime Solution for Deep Learning Applications using MPSoC on Edge Devices","authors":"Hua Jiang, Raghav Chakravarthy, Ravikumar V. Chakaravarthy","doi":"10.1109/asp-dac52403.2022.9712581","DOIUrl":null,"url":null,"abstract":"AI on edge devices [1]–[4] are becoming increasing popular over the last few years. There are many research projects TVM [5] TensorFlow lite [6] that have focused on deployment and acceleration of AI/ML models on edge devices. These solutions have predominantly used data parallelism to accelerate AI/ML models on the edge device using operator fusion, nested parallelism, memory latency hiding [5] etc. to achieve best performance on the supported hardware backends. However, when the hardware supports multiple heterogenous hardware backends it becomes important to support task parallelism in addition to data parallelism to achieve optimal performance. Tasks level parallelism [7] [8] helps break down an AI/ML model into multiple tasks that can be scheduled across various heterogenous backends available in a multi-processor system on chip (MPSoC). In our proposed solution we take an AI/ML compute graph and break it into a directed acyclic graph (DAG) such that each node of the DAG represents a sub-graph of the original compute graph. The nodes of the DAG are generated using an auto-tuner to achieve optimal performance for the corresponding hardware backend. The nodes are compiled into a binary executable for the targeted hardware backend and we are extending our machine learning framework, XTA [9], to generate DAG. The XTA runtime will analyze the DAG and generate scheduling configuration. The nodes of the DAG are analyzed for dependencies and parallelized or pipelined accordingly. We are seeing a 30% improvement over the current solutions by parallelizing the execution of nodes in the DAG. The performance can be further optimized by using more hardware backend cores of the MPSoC to execute the nodes of the DAG in parallel, which is missing in the existing solutions.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/asp-dac52403.2022.9712581","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
AI on edge devices [1]–[4] are becoming increasing popular over the last few years. There are many research projects TVM [5] TensorFlow lite [6] that have focused on deployment and acceleration of AI/ML models on edge devices. These solutions have predominantly used data parallelism to accelerate AI/ML models on the edge device using operator fusion, nested parallelism, memory latency hiding [5] etc. to achieve best performance on the supported hardware backends. However, when the hardware supports multiple heterogenous hardware backends it becomes important to support task parallelism in addition to data parallelism to achieve optimal performance. Tasks level parallelism [7] [8] helps break down an AI/ML model into multiple tasks that can be scheduled across various heterogenous backends available in a multi-processor system on chip (MPSoC). In our proposed solution we take an AI/ML compute graph and break it into a directed acyclic graph (DAG) such that each node of the DAG represents a sub-graph of the original compute graph. The nodes of the DAG are generated using an auto-tuner to achieve optimal performance for the corresponding hardware backend. The nodes are compiled into a binary executable for the targeted hardware backend and we are extending our machine learning framework, XTA [9], to generate DAG. The XTA runtime will analyze the DAG and generate scheduling configuration. The nodes of the DAG are analyzed for dependencies and parallelized or pipelined accordingly. We are seeing a 30% improvement over the current solutions by parallelizing the execution of nodes in the DAG. The performance can be further optimized by using more hardware backend cores of the MPSoC to execute the nodes of the DAG in parallel, which is missing in the existing solutions.