{"title":"多gpu上可扩展和负载均衡的全图GNN训练","authors":"Qiange Wang;Yao Chen;Weng-Fai Wong;Bingsheng He","doi":"10.1109/TKDE.2025.3558641","DOIUrl":null,"url":null,"abstract":"While full-graph training is effective for graph learning, it typically demands substantial memory resources. Existing multi-GPU training frameworks struggle with scalability because they require retaining data for each layer within GPU memory. In this work, we present <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula>, a memory-efficient system that supports out-of-memory full-graph GNN training on GPUs. <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> offloads vertex data to CPU memory and employs partition parallelism training that splits and assigns large graphs to multiple GPUs. To reduce runtime memory consumption with optimal performance, <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> utilizes a hybrid solution combining recomputation, caching, and computation-reordering, enabling efficient layer-wise intermediate data management. To address the increased communication caused by duplicated neighbor access among partitions, <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> employs a deduplicated communication framework that converts host-GPU transfers into more efficient inter/intra-GPU data access. Additionally, <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> tackles the load-imbalance issues in out-of-memory full-graph training, featuring a multi-objective graph partition algorithm that balances memory consumption and data transfer and maximizes the effectiveness of communication deduplication. Experiments on a 4× A100 GPU server show that <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> can effectively train graphs with billion edges while reducing host-GPU data communication by 25% to 71% . Compared to the full-graph GNN system running on 16 CPU nodes, <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> achieves speedups ranging from 11.4× to 21.3×.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4239-4253"},"PeriodicalIF":10.4000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scalable and Load-Balanced Full-Graph GNN Training on Multiple GPUs\",\"authors\":\"Qiange Wang;Yao Chen;Weng-Fai Wong;Bingsheng He\",\"doi\":\"10.1109/TKDE.2025.3558641\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While full-graph training is effective for graph learning, it typically demands substantial memory resources. Existing multi-GPU training frameworks struggle with scalability because they require retaining data for each layer within GPU memory. In this work, we present <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula>, a memory-efficient system that supports out-of-memory full-graph GNN training on GPUs. <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> offloads vertex data to CPU memory and employs partition parallelism training that splits and assigns large graphs to multiple GPUs. To reduce runtime memory consumption with optimal performance, <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> utilizes a hybrid solution combining recomputation, caching, and computation-reordering, enabling efficient layer-wise intermediate data management. To address the increased communication caused by duplicated neighbor access among partitions, <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> employs a deduplicated communication framework that converts host-GPU transfers into more efficient inter/intra-GPU data access. Additionally, <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> tackles the load-imbalance issues in out-of-memory full-graph training, featuring a multi-objective graph partition algorithm that balances memory consumption and data transfer and maximizes the effectiveness of communication deduplication. Experiments on a 4× A100 GPU server show that <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> can effectively train graphs with billion edges while reducing host-GPU data communication by 25% to 71% . Compared to the full-graph GNN system running on 16 CPU nodes, <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> achieves speedups ranging from 11.4× to 21.3×.\",\"PeriodicalId\":13496,\"journal\":{\"name\":\"IEEE Transactions on Knowledge and Data Engineering\",\"volume\":\"37 7\",\"pages\":\"4239-4253\"},\"PeriodicalIF\":10.4000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Knowledge and Data Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10955266/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10955266/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Scalable and Load-Balanced Full-Graph GNN Training on Multiple GPUs
While full-graph training is effective for graph learning, it typically demands substantial memory resources. Existing multi-GPU training frameworks struggle with scalability because they require retaining data for each layer within GPU memory. In this work, we present $\mathsf {HongTu }$, a memory-efficient system that supports out-of-memory full-graph GNN training on GPUs. $\mathsf {HongTu }$ offloads vertex data to CPU memory and employs partition parallelism training that splits and assigns large graphs to multiple GPUs. To reduce runtime memory consumption with optimal performance, $\mathsf {HongTu }$ utilizes a hybrid solution combining recomputation, caching, and computation-reordering, enabling efficient layer-wise intermediate data management. To address the increased communication caused by duplicated neighbor access among partitions, $\mathsf {HongTu }$ employs a deduplicated communication framework that converts host-GPU transfers into more efficient inter/intra-GPU data access. Additionally, $\mathsf {HongTu }$ tackles the load-imbalance issues in out-of-memory full-graph training, featuring a multi-objective graph partition algorithm that balances memory consumption and data transfer and maximizes the effectiveness of communication deduplication. Experiments on a 4× A100 GPU server show that $\mathsf {HongTu }$ can effectively train graphs with billion edges while reducing host-GPU data communication by 25% to 71% . Compared to the full-graph GNN system running on 16 CPU nodes, $\mathsf {HongTu }$ achieves speedups ranging from 11.4× to 21.3×.
期刊介绍:
The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.