MATCH：异构边缘设备基于模型感知的tvm编译

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2025-04-01 DOI:10.1109/TCAD.2025.3556967

Mohamed Amine Hamdi;Francesco Daghero;Giuseppe Maria Sarda;Josse Van Delm;Arne Symons;Luca Benini;Marian Verhelst;Daniele Jahier Pagliari;Alessio Burrello

{"title":"MATCH：异构边缘设备基于模型感知的tvm编译","authors":"Mohamed Amine Hamdi;Francesco Daghero;Giuseppe Maria Sarda;Josse Van Delm;Arne Symons;Luca Benini;Marian Verhelst;Daniele Jahier Pagliari;Alessio Burrello","doi":"10.1109/TCAD.2025.3556967","DOIUrl":null,"url":null,"abstract":"Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting them to a different one implies labor-intensive redevelopment of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, producing general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction. We show that a general and retargetable mapping framework can compete with, and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware cost model and a SoC-specific API. We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA. On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency on average by <inline-formula> <tex-math>$60.87\\times $ </tex-math></inline-formula> on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by <inline-formula> <tex-math>$2.15\\times $ </tex-math></inline-formula> compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3844-3857"},"PeriodicalIF":2.9000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MATCH: Model-Aware TVM-Based Compilation for Heterogeneous Edge Devices\",\"authors\":\"Mohamed Amine Hamdi;Francesco Daghero;Giuseppe Maria Sarda;Josse Van Delm;Arne Symons;Luca Benini;Marian Verhelst;Daniele Jahier Pagliari;Alessio Burrello\",\"doi\":\"10.1109/TCAD.2025.3556967\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting them to a different one implies labor-intensive redevelopment of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, producing general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction. We show that a general and retargetable mapping framework can compete with, and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware cost model and a SoC-specific API. We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA. On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency on average by <inline-formula> <tex-math>$60.87\\\\times $ </tex-math></inline-formula> on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by <inline-formula> <tex-math>$2.15\\\\times $ </tex-math></inline-formula> compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board.\",\"PeriodicalId\":13251,\"journal\":{\"name\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"volume\":\"44 10\",\"pages\":\"3844-3857\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10946988/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10946988/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

在异构边缘平台上简化深度神经网络（dnn）的部署，在相同的微控制器单元（MCU）指令处理器和张量计算硬件加速器中耦合，正在成为TinyML领域的关键挑战之一。性能最好的DNN编译工具链通常是为单个MCU系列深度定制的，将它们移植到不同的MCU意味着几乎整个编译器的劳动密集型重新开发。另一方面，可重定向的工具链，如TVM，无法利用自定义加速器的功能，生成一般但未优化的代码。为了克服这种二元性，我们引入了MATCH，这是一种新颖的基于tvm的DNN部署框架，通过可定制的基于模型的硬件抽象，可以在不同的MCU处理器和加速器之间轻松灵活地重新定位。我们展示了一个通用的和可重定向的映射框架可以在不同的目标上与自定义工具链竞争，甚至优于自定义工具链，而只需要定义一个抽象的硬件成本模型和一个特定于soc的API。我们在两个最先进的异构mcu GAP9和DIANA上测试了MATCH。在MLPerf Tiny套件的四个DNN模型上，与使用普通TVM相比，MATCH在DIANA上平均减少了60.87倍的推理延迟，这要归功于机载HW加速器的利用。与HTVM（一个完全定制的DIANA工具链）相比，我们仍然减少了16.94%的延迟。在GAP9上，使用相同的基准测试，与专用DORY编译器相比，我们将延迟提高了2.15倍，这要归功于我们的异构DNN映射方法，该方法协同利用了DNN加速器和板上可用的八核集群。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MATCH: Model-Aware TVM-Based Compilation for Heterogeneous Edge Devices

Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting them to a different one implies labor-intensive redevelopment of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, producing general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction. We show that a general and retargetable mapping framework can compete with, and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware cost model and a SoC-specific API. We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA. On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency on average by

$60.87\times $

on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by

$2.15\times $

compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.