Hui Yu, Yu Zhang, Jin Zhao, Yujian Liao, Zhiying Huang, Donghao He, Lin Gu, Hai Jin, Xiaofei Liao, Haikun Liu, Bin He, Jianhui Yue
{"title":"RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network","authors":"Hui Yu, Yu Zhang, Jin Zhao, Yujian Liao, Zhiying Huang, Donghao He, Lin Gu, Hai Jin, Xiaofei Liao, Haikun Liu, Bin He, Jianhui Yue","doi":"10.1145/3617685","DOIUrl":null,"url":null,"abstract":"Dynamic Graph Neural Network (DGNN) has recently attracted a significant amount of research attention from various domains, because most real-world graphs are inherently dynamic. Despite of many research effort, for DGNN, existing hardware/software solutions still suffer significantly from redundant computation and memory access overhead, because they need to irregularly access and recompute all graph data of each graph snapshot. To address these issues, we propose an efficient redundancy-aware accelerator, RACE, which enables energy-efficient execution of DGNN models. Specifically, we propose a redundancy-aware incremental execution approach into the accelerator design for DGNN to instantly achieve the output features of the latest graph snapshot by correctly and incrementally refining the output features of the previous graph snapshot and also enable regular accesses of vertices’ input features. Through traversing the graph on the fly, RACE identifies the vertices which are not affected by graph updates between successive snapshots to reuse these vertices’ states (i.e., their output features) of the previous snapshot for the processing of the latest snapshot. The vertices affected by graph updates are also tracked to incrementally recompute their new states using their neighbors’ input features of the latest snapshot for correctness. In this way, the processing and accessing of many graph data which are not affected by graph updates can be correctly eliminated, enabling smaller redundant computation and memory access overhead. Besides, the input features, which are accessed more frequently, are dynamically identified according to graph topology and are preferentially resident in the on-chip memory for less off-chip communications. Experimental results show that RACE achieves on average 1139x and 84.7x speedups for DGNN inference, with average 2242x and 234.2x energy savings, in comparison with the state-of-the-art software DGNN running on Intel Xeon CPU and NVIDIA A100 GPU, respectively. Moreover, for DGNN inference, RACE obtains on average 13.1x, 11.7x, 10.4x, 7.9x speedup and average 14.8x, 12.9x, 11.5x, 8.9x energy saving over the state-of-the-art GNN accelerators, i.e., AWB-GCN, GCNAX, ReGNN, and I-GCN, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"13 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3617685","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Dynamic Graph Neural Network (DGNN) has recently attracted a significant amount of research attention from various domains, because most real-world graphs are inherently dynamic. Despite of many research effort, for DGNN, existing hardware/software solutions still suffer significantly from redundant computation and memory access overhead, because they need to irregularly access and recompute all graph data of each graph snapshot. To address these issues, we propose an efficient redundancy-aware accelerator, RACE, which enables energy-efficient execution of DGNN models. Specifically, we propose a redundancy-aware incremental execution approach into the accelerator design for DGNN to instantly achieve the output features of the latest graph snapshot by correctly and incrementally refining the output features of the previous graph snapshot and also enable regular accesses of vertices’ input features. Through traversing the graph on the fly, RACE identifies the vertices which are not affected by graph updates between successive snapshots to reuse these vertices’ states (i.e., their output features) of the previous snapshot for the processing of the latest snapshot. The vertices affected by graph updates are also tracked to incrementally recompute their new states using their neighbors’ input features of the latest snapshot for correctness. In this way, the processing and accessing of many graph data which are not affected by graph updates can be correctly eliminated, enabling smaller redundant computation and memory access overhead. Besides, the input features, which are accessed more frequently, are dynamically identified according to graph topology and are preferentially resident in the on-chip memory for less off-chip communications. Experimental results show that RACE achieves on average 1139x and 84.7x speedups for DGNN inference, with average 2242x and 234.2x energy savings, in comparison with the state-of-the-art software DGNN running on Intel Xeon CPU and NVIDIA A100 GPU, respectively. Moreover, for DGNN inference, RACE obtains on average 13.1x, 11.7x, 10.4x, 7.9x speedup and average 14.8x, 12.9x, 11.5x, 8.9x energy saving over the state-of-the-art GNN accelerators, i.e., AWB-GCN, GCNAX, ReGNN, and I-GCN, respectively.
期刊介绍:
ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.