Inference framework supporting parallel execution across heterogeneous accelerators

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture Pub Date : 2025-06-20 DOI:10.1016/j.sysarc.2025.103508

Philkyue Shin , Myungsun Kim , Seongsoo Hong

{"title":"Inference framework supporting parallel execution across heterogeneous accelerators","authors":"Philkyue Shin , Myungsun Kim , Seongsoo Hong","doi":"10.1016/j.sysarc.2025.103508","DOIUrl":null,"url":null,"abstract":"<div><div>The growing demand for on-device deep learning inference, particularly in latency-sensitive applications, has driven the adoption of heterogeneous accelerators that incorporate GPUs, DSPs, and NPUs. While these accelerators offer improved energy efficiency, their heterogeneity introduces significant programming complexity due to reliance on vendor-specific APIs. Existing deep learning inference frameworks, such as LiteRT, provide high-level APIs and support multiple backend APIs. However, they lack the ability to exploit parallel execution across heterogeneous accelerators. This paper introduces a novel inference framework that overcomes this limitation. Our framework utilizes a batch inference API to enable parallel execution across heterogeneous accelerators. The framework schedules heterogeneous accelerators to process batched inputs concurrently. To address the challenge of integrating commercial NPU APIs that do not support LiteRT, we develop a portable hooking engine. Furthermore, the framework mitigates accuracy inconsistencies arising from diverse quantization methods by dynamically adjusting postprocessing parameters to balance accuracy and latency. The proposed framework minimizes both average turnaround time and postprocessing time. Experimental results on a Qualcomm SA8195 SoC with Mobilint and Hailo NPUs demonstrate significant performance improvements compared to existing inference frameworks.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103508"},"PeriodicalIF":3.7000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125001808","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

The growing demand for on-device deep learning inference, particularly in latency-sensitive applications, has driven the adoption of heterogeneous accelerators that incorporate GPUs, DSPs, and NPUs. While these accelerators offer improved energy efficiency, their heterogeneity introduces significant programming complexity due to reliance on vendor-specific APIs. Existing deep learning inference frameworks, such as LiteRT, provide high-level APIs and support multiple backend APIs. However, they lack the ability to exploit parallel execution across heterogeneous accelerators. This paper introduces a novel inference framework that overcomes this limitation. Our framework utilizes a batch inference API to enable parallel execution across heterogeneous accelerators. The framework schedules heterogeneous accelerators to process batched inputs concurrently. To address the challenge of integrating commercial NPU APIs that do not support LiteRT, we develop a portable hooking engine. Furthermore, the framework mitigates accuracy inconsistencies arising from diverse quantization methods by dynamically adjusting postprocessing parameters to balance accuracy and latency. The proposed framework minimizes both average turnaround time and postprocessing time. Experimental results on a Qualcomm SA8195 SoC with Mobilint and Hailo NPUs demonstrate significant performance improvements compared to existing inference frameworks.

查看原文本刊更多论文

支持跨异构加速器并行执行的推理框架

对设备上深度学习推理的需求不断增长，特别是在对延迟敏感的应用中，推动了集成gpu、dsp和npu的异构加速器的采用。虽然这些加速器提供了改进的能源效率，但由于依赖于特定于供应商的api，它们的异构性引入了显著的编程复杂性。现有的深度学习推理框架（如litt）提供了高级api，并支持多个后端api。然而，它们缺乏跨异构加速器并行执行的能力。本文介绍了一种新的推理框架，克服了这一限制。我们的框架利用批处理推理API来实现跨异构加速器的并行执行。框架调度异构加速器并发处理批处理输入。为了解决集成不支持litt的商用NPU api的挑战，我们开发了一个便携式钩子引擎。此外，该框架通过动态调整后处理参数来平衡精度和延迟，减轻了不同量化方法引起的精度不一致。所建议的框架最小化了平均周转时间和后处理时间。与现有推理框架相比，在高通SA8195 SoC上与Mobilint和Hailo npu的实验结果表明性能有显着提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Systems Architecture 工程技术-计算机：硬件

CiteScore

8.70

自引率

15.60%

发文量

226

审稿时长

46 days

期刊介绍： The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software. Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.