{"title":"Inference framework supporting parallel execution across heterogeneous accelerators","authors":"Philkyue Shin , Myungsun Kim , Seongsoo Hong","doi":"10.1016/j.sysarc.2025.103508","DOIUrl":null,"url":null,"abstract":"<div><div>The growing demand for on-device deep learning inference, particularly in latency-sensitive applications, has driven the adoption of heterogeneous accelerators that incorporate GPUs, DSPs, and NPUs. While these accelerators offer improved energy efficiency, their heterogeneity introduces significant programming complexity due to reliance on vendor-specific APIs. Existing deep learning inference frameworks, such as LiteRT, provide high-level APIs and support multiple backend APIs. However, they lack the ability to exploit parallel execution across heterogeneous accelerators. This paper introduces a novel inference framework that overcomes this limitation. Our framework utilizes a batch inference API to enable parallel execution across heterogeneous accelerators. The framework schedules heterogeneous accelerators to process batched inputs concurrently. To address the challenge of integrating commercial NPU APIs that do not support LiteRT, we develop a portable hooking engine. Furthermore, the framework mitigates accuracy inconsistencies arising from diverse quantization methods by dynamically adjusting postprocessing parameters to balance accuracy and latency. The proposed framework minimizes both average turnaround time and postprocessing time. Experimental results on a Qualcomm SA8195 SoC with Mobilint and Hailo NPUs demonstrate significant performance improvements compared to existing inference frameworks.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103508"},"PeriodicalIF":3.7000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125001808","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
The growing demand for on-device deep learning inference, particularly in latency-sensitive applications, has driven the adoption of heterogeneous accelerators that incorporate GPUs, DSPs, and NPUs. While these accelerators offer improved energy efficiency, their heterogeneity introduces significant programming complexity due to reliance on vendor-specific APIs. Existing deep learning inference frameworks, such as LiteRT, provide high-level APIs and support multiple backend APIs. However, they lack the ability to exploit parallel execution across heterogeneous accelerators. This paper introduces a novel inference framework that overcomes this limitation. Our framework utilizes a batch inference API to enable parallel execution across heterogeneous accelerators. The framework schedules heterogeneous accelerators to process batched inputs concurrently. To address the challenge of integrating commercial NPU APIs that do not support LiteRT, we develop a portable hooking engine. Furthermore, the framework mitigates accuracy inconsistencies arising from diverse quantization methods by dynamically adjusting postprocessing parameters to balance accuracy and latency. The proposed framework minimizes both average turnaround time and postprocessing time. Experimental results on a Qualcomm SA8195 SoC with Mobilint and Hailo NPUs demonstrate significant performance improvements compared to existing inference frameworks.
期刊介绍:
The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software.
Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.