2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)最新文献

筛选
英文 中文
Low-latency Mini-batch GNN Inference on CPU-FPGA Heterogeneous Platform 基于CPU-FPGA异构平台的低延迟小批量GNN推理
Bingyi Zhang, Hanqing Zeng, V. Prasanna
{"title":"Low-latency Mini-batch GNN Inference on CPU-FPGA Heterogeneous Platform","authors":"Bingyi Zhang, Hanqing Zeng, V. Prasanna","doi":"10.1109/HiPC56025.2022.00015","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00015","url":null,"abstract":"Mini-batch inference of Graph Neural Networks (GNNs) is a key problem in many real-world applications. In this paper, we develop a computationally efficient mapping of GNNs onto CPU-FPGA heterogeneous platforms to achieve low-latency mini-batch inference. While the lightweight preprocessing algorithm of GNNs can be efficiently mapped onto the CPU platform, on the FPGA platform, we design a novel GNN hardware accelerator with an adaptive datapath denoted as Adaptive Computation Kernel (ACK) that can execute various computation kernels of GNNs with low-latency: (1) for dense computation kernels expressed as matrix multiplication, ACK works as a systolic array with fully localized connections, (2) for sparse computation kernels, ACK follows the scatter-gather paradigm and works as multiple parallel pipelines to support the irregular connectivity of graphs. The proposed task scheduling hides the CPU-FPGA data communication overhead to reduce the inference latency. We develop a fast design space exploration algorithm to generate a single accelerator for multiple target GNN models. We implement our accelerator on a state-of-the-art CPU-FPGA platform and evaluate the performance using three representative models (GCN, GraphSAGE, GAT). Results show that our CPU-FPGA implementation achieves 21.4−50.8×, 2.9 − 21.6×, 4.7× latency reduction compared with state-of-the-art implementations on CPU-only, CPU-GPU and CPU-FPGA platforms.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133724342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs 基于gpu的深度学习推荐模型训练性能模型构建
Zhongyi Lin, Louis Feng, E. K. Ardestani, Jaewon Lee, J. Lundell, Changkyu Kim, A. Kejariwal, John Douglas Owens
{"title":"Building a Performance Model for Deep Learning Recommendation Model Training on GPUs","authors":"Zhongyi Lin, Louis Feng, E. K. Ardestani, Jaewon Lee, J. Lundell, Changkyu Kim, A. Kejariwal, John Douglas Owens","doi":"10.1109/HiPC56025.2022.00019","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00019","url":null,"abstract":"We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) but also the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 4.61% and 7.96% geomean errors for GPU active time and overall E2E per-batch training time prediction with overheads from individual workloads, respectively. A slight increase of 2.19% incurred in E2E prediction error with shared overheads across workloads suggests the feasibility of using shared overheads in large-scale prediction. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analysis, we show our system can provide more general model-system co-design than previous methods.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"129 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121389519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed 1位LAMB:具有LAMB收敛速度的高效通信大规模批量训练
Conglong Li, A. Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He
{"title":"1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed","authors":"Conglong Li, A. Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He","doi":"10.1109/HiPC56025.2022.00044","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00044","url":null,"abstract":"To train large machine learning models (like BERT and GPT-3) on hundreds of GPUs, communication has become a significant bottleneck, especially on commodity systems with limited-bandwidth TCP networks. On one side, large batch-size optimization such as the LAMB algorithm was proposed to reduce the frequency of communication. On the other side, communication compression algorithms such as 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially under low network bandwidth. Motivated by this we aim to combine the power of large-batch optimization and communication compression but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient optimization algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates under compression. In addition to the algorithm and corresponding theoretical analysis, we propose three novel system implementations in order to achieve actual wall clock speedup: a momentum fusion mechanism to reduce the number of communications, a momentum scaling technique to reduce compression error, and a NCCL-based compressed communication backend to improve both usability and performance. For the BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that our optimized implementation of 1-bit LAMB is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end time-wise speedup, and the same sample-wise convergence speed (and same fine-tuning task accuracy) compared to uncompressed LAMB. Furthermore, 1-bit LAMB achieves the same accuracy as LAMB on computer vision tasks like ImageNet and CIFAR100.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114092764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信