MTrain: Enable Efficient CNN Training on Heterogeneous FPGA-Based Edge Servers

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2025-02-12 DOI:10.1109/TCAD.2025.3541486

Yue Tang;Alex K. Jones;Jinjun Xiong;Peipei Zhou;Jingtong Hu

{"title":"MTrain: Enable Efficient CNN Training on Heterogeneous FPGA-Based Edge Servers","authors":"Yue Tang;Alex K. Jones;Jinjun Xiong;Peipei Zhou;Jingtong Hu","doi":"10.1109/TCAD.2025.3541486","DOIUrl":null,"url":null,"abstract":"FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks, including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while retraining the model during the night to adapt to new environments, domains, or new users. During the retraining, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multiaccelerator training algorithms select optimal scheduling strategies for data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). However, PP cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying DP and TP suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multiaccelerator training scheduling strategy that transfers the training process into a multibranch workflow, thus independent suboperations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with <inline-formula> <tex-math>$1.07\\times $ </tex-math></inline-formula> –<inline-formula> <tex-math>$2.21\\times $ </tex-math></inline-formula> speedup under 15-GB/s peer-to-peer bandwidth compared to the state-of-the-art work.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3395-3408"},"PeriodicalIF":2.9000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10883660/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks, including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while retraining the model during the night to adapt to new environments, domains, or new users. During the retraining, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multiaccelerator training algorithms select optimal scheduling strategies for data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). However, PP cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying DP and TP suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multiaccelerator training scheduling strategy that transfers the training process into a multibranch workflow, thus independent suboperations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with

$1.07\times $

–

$2.21\times $

speedup under 15-GB/s peer-to-peer bandwidth compared to the state-of-the-art work.

查看原文本刊更多论文

MTrain：在异构fpga边缘服务器上实现高效CNN训练

基于fpga的边缘服务器用于智慧城市、医院、零售等领域的许多应用。服务器配备基于异构fpga的加速卡，可以实现多种任务，包括高效视频预处理，机器学习算法加速等。这些服务器需要在白天实现推理，同时在晚上重新训练模型以适应新的环境、域或新用户。在再训练过程中，通常将传入的数据传输到云端，然后将更新的机器学习模型传输回边缘服务器。这样的过程效率低下，不能保护用户的隐私，因此希望在边缘服务器上直接训练模型。在异构资源受限的fpga上部署卷积神经网络训练具有挑战性，因为它既需要考虑训练过程中复杂的数据依赖性，又需要考虑不同fpga之间的通信瓶颈。以往的多加速器训练算法选择了数据并行性（DP）、张量并行性（TP）和管道并行性（PP）的最优调度策略。然而，PP无法处理批归一化（batch normalization， BN），而批归一化是CNN必不可少的算子，而单纯应用DP和TP则存在资源利用率不足和通信成本高的问题。在本文中，我们提出了一种新的多加速器训练调度策略MTrain，该策略将训练过程转移到多分支工作流中，从而在不同的训练加速器上并行执行不同分支的独立子操作，从而更好地利用和减少通信开销。实验结果表明，我们可以在基于异构fpga的边缘服务器上实现高效的CNN训练，在15 gb /s的点对点带宽下，与最先进的工作相比，速度提高了1.07倍至2.21倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.