MTrain: Enable Efficient CNN Training on Heterogeneous FPGA-Based Edge Servers

IF 2.9 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Yue Tang;Alex K. Jones;Jinjun Xiong;Peipei Zhou;Jingtong Hu
{"title":"MTrain: Enable Efficient CNN Training on Heterogeneous FPGA-Based Edge Servers","authors":"Yue Tang;Alex K. Jones;Jinjun Xiong;Peipei Zhou;Jingtong Hu","doi":"10.1109/TCAD.2025.3541486","DOIUrl":null,"url":null,"abstract":"FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks, including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while retraining the model during the night to adapt to new environments, domains, or new users. During the retraining, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multiaccelerator training algorithms select optimal scheduling strategies for data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). However, PP cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying DP and TP suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multiaccelerator training scheduling strategy that transfers the training process into a multibranch workflow, thus independent suboperations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with <inline-formula> <tex-math>$1.07\\times $ </tex-math></inline-formula> –<inline-formula> <tex-math>$2.21\\times $ </tex-math></inline-formula> speedup under 15-GB/s peer-to-peer bandwidth compared to the state-of-the-art work.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3395-3408"},"PeriodicalIF":2.9000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10883660/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks, including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while retraining the model during the night to adapt to new environments, domains, or new users. During the retraining, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multiaccelerator training algorithms select optimal scheduling strategies for data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). However, PP cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying DP and TP suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multiaccelerator training scheduling strategy that transfers the training process into a multibranch workflow, thus independent suboperations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with $1.07\times $ $2.21\times $ speedup under 15-GB/s peer-to-peer bandwidth compared to the state-of-the-art work.
MTrain:在异构fpga边缘服务器上实现高效CNN训练
基于fpga的边缘服务器用于智慧城市、医院、零售等领域的许多应用。服务器配备基于异构fpga的加速卡,可以实现多种任务,包括高效视频预处理,机器学习算法加速等。这些服务器需要在白天实现推理,同时在晚上重新训练模型以适应新的环境、域或新用户。在再训练过程中,通常将传入的数据传输到云端,然后将更新的机器学习模型传输回边缘服务器。这样的过程效率低下,不能保护用户的隐私,因此希望在边缘服务器上直接训练模型。在异构资源受限的fpga上部署卷积神经网络训练具有挑战性,因为它既需要考虑训练过程中复杂的数据依赖性,又需要考虑不同fpga之间的通信瓶颈。以往的多加速器训练算法选择了数据并行性(DP)、张量并行性(TP)和管道并行性(PP)的最优调度策略。然而,PP无法处理批归一化(batch normalization, BN),而批归一化是CNN必不可少的算子,而单纯应用DP和TP则存在资源利用率不足和通信成本高的问题。在本文中,我们提出了一种新的多加速器训练调度策略MTrain,该策略将训练过程转移到多分支工作流中,从而在不同的训练加速器上并行执行不同分支的独立子操作,从而更好地利用和减少通信开销。实验结果表明,我们可以在基于异构fpga的边缘服务器上实现高效的CNN训练,在15 gb /s的点对点带宽下,与最先进的工作相比,速度提高了1.07倍至2.21倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.60
自引率
13.80%
发文量
500
审稿时长
7 months
期刊介绍: The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信