Anqi Guo, Tong Geng, Yongan Zhang, Pouya Haghi, Chunshu Wu, Cheng Tan, Yingyan Lin, Ang Li, Martin C. Herbordt
{"title":"A Framework for Neural Network Inference on FPGA-Centric SmartNICs","authors":"Anqi Guo, Tong Geng, Yongan Zhang, Pouya Haghi, Chunshu Wu, Cheng Tan, Yingyan Lin, Ang Li, Martin C. Herbordt","doi":"10.1109/FPL57034.2022.00071","DOIUrl":null,"url":null,"abstract":"FPGA-based SmartNICs offer great potential to significantly improve the performance of high-performance computing and warehouse data processing by tightly coupling support for reconfigurable data-intensive computation with cross-node communication thereby mitigating the von Neumann bottleneck. Existing work however has generally been limited in that it assumes an accelerator model where kernels are offloaded to SmartNICs with most control tasks left to the CPUs. This leads to frequent waiting reduced performance and scaling challenges. In this work we propose a new distributive data-centric computing framework named FCsN for reconfigurable SmartNIC-based systems. Through a lightweight task circulation execution model and its implementation architecture FCsN allows the complete detaching of NN kernel execution control logic system scheduling and network communication to the SmartNICs. This boosts performance by (i) avoiding control dependency with CPUs and (ii) supporting streaming NN kernel execution and network communication at line rate and in a very fine-grained manner. We demonstrate the efficiency and flexibility of FCsN using various types of neural network kernels and applications including deep neural networks (DNN) and graph neural networks (GNN) as these last are both irregular and data intensive they offer an especially robust demonstration. Evaluations using commonly-used neural network models and graph datasets show that a system with FCsN can achieve 10 × speedups over the MPI-based standard CPU baselines","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"253 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPL57034.2022.00071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
FPGA-based SmartNICs offer great potential to significantly improve the performance of high-performance computing and warehouse data processing by tightly coupling support for reconfigurable data-intensive computation with cross-node communication thereby mitigating the von Neumann bottleneck. Existing work however has generally been limited in that it assumes an accelerator model where kernels are offloaded to SmartNICs with most control tasks left to the CPUs. This leads to frequent waiting reduced performance and scaling challenges. In this work we propose a new distributive data-centric computing framework named FCsN for reconfigurable SmartNIC-based systems. Through a lightweight task circulation execution model and its implementation architecture FCsN allows the complete detaching of NN kernel execution control logic system scheduling and network communication to the SmartNICs. This boosts performance by (i) avoiding control dependency with CPUs and (ii) supporting streaming NN kernel execution and network communication at line rate and in a very fine-grained manner. We demonstrate the efficiency and flexibility of FCsN using various types of neural network kernels and applications including deep neural networks (DNN) and graph neural networks (GNN) as these last are both irregular and data intensive they offer an especially robust demonstration. Evaluations using commonly-used neural network models and graph datasets show that a system with FCsN can achieve 10 × speedups over the MPI-based standard CPU baselines