{"title":"Low-latency Mini-batch GNN Inference on CPU-FPGA Heterogeneous Platform","authors":"Bingyi Zhang, Hanqing Zeng, V. Prasanna","doi":"10.1109/HiPC56025.2022.00015","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00015","url":null,"abstract":"Mini-batch inference of Graph Neural Networks (GNNs) is a key problem in many real-world applications. In this paper, we develop a computationally efficient mapping of GNNs onto CPU-FPGA heterogeneous platforms to achieve low-latency mini-batch inference. While the lightweight preprocessing algorithm of GNNs can be efficiently mapped onto the CPU platform, on the FPGA platform, we design a novel GNN hardware accelerator with an adaptive datapath denoted as Adaptive Computation Kernel (ACK) that can execute various computation kernels of GNNs with low-latency: (1) for dense computation kernels expressed as matrix multiplication, ACK works as a systolic array with fully localized connections, (2) for sparse computation kernels, ACK follows the scatter-gather paradigm and works as multiple parallel pipelines to support the irregular connectivity of graphs. The proposed task scheduling hides the CPU-FPGA data communication overhead to reduce the inference latency. We develop a fast design space exploration algorithm to generate a single accelerator for multiple target GNN models. We implement our accelerator on a state-of-the-art CPU-FPGA platform and evaluate the performance using three representative models (GCN, GraphSAGE, GAT). Results show that our CPU-FPGA implementation achieves 21.4−50.8×, 2.9 − 21.6×, 4.7× latency reduction compared with state-of-the-art implementations on CPU-only, CPU-GPU and CPU-FPGA platforms.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133724342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhongyi Lin, Louis Feng, E. K. Ardestani, Jaewon Lee, J. Lundell, Changkyu Kim, A. Kejariwal, John Douglas Owens
{"title":"Building a Performance Model for Deep Learning Recommendation Model Training on GPUs","authors":"Zhongyi Lin, Louis Feng, E. K. Ardestani, Jaewon Lee, J. Lundell, Changkyu Kim, A. Kejariwal, John Douglas Owens","doi":"10.1109/HiPC56025.2022.00019","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00019","url":null,"abstract":"We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19% incurred in E2E prediction error with shared overheads across workloads suggests the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"129 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121389519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conglong Li, A. Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He
{"title":"1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed","authors":"Conglong Li, A. Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He","doi":"10.1109/HiPC56025.2022.00044","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00044","url":null,"abstract":"To train large machine learning models (like BERT and GPT-3) on hundreds of GPUs, communication has become a significant bottleneck, especially on commodity systems with limited-bandwidth TCP networks. On one side, large batch-size optimization such as the LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. Motivated by this we aim to combine the power of large-batch optimization and communication compression but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient optimization algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. In addition to the algorithm and corresponding theoretical analysis, we propose three novel system implementations in order to achieve actual wall clock speedup: a momentum fusion mechanism to reduce the number of communications, a momentum scaling technique to reduce compression error, and a NCCL-based compressed communication backend to improve both usability and performance. For the BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that our optimized implementation of 1-bit LAMB is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end time-wise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB. Furthermore, 1-bit LAMB achieves the same accuracy as LAMB on computer vision tasks like ImageNet and CIFAR100.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114092764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}