Yue Tang;Alex K. Jones;Jinjun Xiong;Peipei Zhou;Jingtong Hu
{"title":"MTrain: Enable Efficient CNN Training on Heterogeneous FPGA-Based Edge Servers","authors":"Yue Tang;Alex K. Jones;Jinjun Xiong;Peipei Zhou;Jingtong Hu","doi":"10.1109/TCAD.2025.3541486","DOIUrl":null,"url":null,"abstract":"FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks, including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while retraining the model during the night to adapt to new environments, domains, or new users. During the retraining, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multiaccelerator training algorithms select optimal scheduling strategies for data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). However, PP cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying DP and TP suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multiaccelerator training scheduling strategy that transfers the training process into a multibranch workflow, thus independent suboperations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with <inline-formula> <tex-math>$1.07\\times $ </tex-math></inline-formula> –<inline-formula> <tex-math>$2.21\\times $ </tex-math></inline-formula> speedup under 15-GB/s peer-to-peer bandwidth compared to the state-of-the-art work.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 9","pages":"3395-3408"},"PeriodicalIF":2.9000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10883660/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks, including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while retraining the model during the night to adapt to new environments, domains, or new users. During the retraining, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multiaccelerator training algorithms select optimal scheduling strategies for data parallelism (DP), tensor parallelism (TP), and pipeline parallelism (PP). However, PP cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying DP and TP suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multiaccelerator training scheduling strategy that transfers the training process into a multibranch workflow, thus independent suboperations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with $1.07\times $ –$2.21\times $ speedup under 15-GB/s peer-to-peer bandwidth compared to the state-of-the-art work.
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.