多gpu上可扩展和负载均衡的全图GNN训练

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2025-04-08 DOI:10.1109/TKDE.2025.3558641

Qiange Wang;Yao Chen;Weng-Fai Wong;Bingsheng He

{"title":"多gpu上可扩展和负载均衡的全图GNN训练","authors":"Qiange Wang;Yao Chen;Weng-Fai Wong;Bingsheng He","doi":"10.1109/TKDE.2025.3558641","DOIUrl":null,"url":null,"abstract":"While full-graph training is effective for graph learning, it typically demands substantial memory resources. Existing multi-GPU training frameworks struggle with scalability because they require retaining data for each layer within GPU memory. In this work, we present <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula>, a memory-efficient system that supports out-of-memory full-graph GNN training on GPUs. <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> offloads vertex data to CPU memory and employs partition parallelism training that splits and assigns large graphs to multiple GPUs. To reduce runtime memory consumption with optimal performance, <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> utilizes a hybrid solution combining recomputation, caching, and computation-reordering, enabling efficient layer-wise intermediate data management. To address the increased communication caused by duplicated neighbor access among partitions, <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> employs a deduplicated communication framework that converts host-GPU transfers into more efficient inter/intra-GPU data access. Additionally, <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> tackles the load-imbalance issues in out-of-memory full-graph training, featuring a multi-objective graph partition algorithm that balances memory consumption and data transfer and maximizes the effectiveness of communication deduplication. Experiments on a 4× A100 GPU server show that <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> can effectively train graphs with billion edges while reducing host-GPU data communication by 25% to 71% . Compared to the full-graph GNN system running on 16 CPU nodes, <inline-formula><tex-math>$\\mathsf {HongTu }$</tex-math></inline-formula> achieves speedups ranging from 11.4× to 21.3×.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4239-4253"},"PeriodicalIF":10.4000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scalable and Load-Balanced Full-Graph GNN Training on Multiple GPUs\",\"authors\":\"Qiange Wang;Yao Chen;Weng-Fai Wong;Bingsheng He\",\"doi\":\"10.1109/TKDE.2025.3558641\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While full-graph training is effective for graph learning, it typically demands substantial memory resources. Existing multi-GPU training frameworks struggle with scalability because they require retaining data for each layer within GPU memory. In this work, we present <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula>, a memory-efficient system that supports out-of-memory full-graph GNN training on GPUs. <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> offloads vertex data to CPU memory and employs partition parallelism training that splits and assigns large graphs to multiple GPUs. To reduce runtime memory consumption with optimal performance, <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> utilizes a hybrid solution combining recomputation, caching, and computation-reordering, enabling efficient layer-wise intermediate data management. To address the increased communication caused by duplicated neighbor access among partitions, <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> employs a deduplicated communication framework that converts host-GPU transfers into more efficient inter/intra-GPU data access. Additionally, <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> tackles the load-imbalance issues in out-of-memory full-graph training, featuring a multi-objective graph partition algorithm that balances memory consumption and data transfer and maximizes the effectiveness of communication deduplication. Experiments on a 4× A100 GPU server show that <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> can effectively train graphs with billion edges while reducing host-GPU data communication by 25% to 71% . Compared to the full-graph GNN system running on 16 CPU nodes, <inline-formula><tex-math>$\\\\mathsf {HongTu }$</tex-math></inline-formula> achieves speedups ranging from 11.4× to 21.3×.\",\"PeriodicalId\":13496,\"journal\":{\"name\":\"IEEE Transactions on Knowledge and Data Engineering\",\"volume\":\"37 7\",\"pages\":\"4239-4253\"},\"PeriodicalIF\":10.4000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Knowledge and Data Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10955266/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10955266/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

虽然全图训练对于图学习是有效的，但它通常需要大量的内存资源。现有的多GPU训练框架难以实现可扩展性，因为它们需要在GPU内存中保留每层的数据。在这项工作中，我们提出了$\mathsf {HongTu}$，这是一个内存高效的系统，支持在gpu上进行内存外全图GNN训练。$\mathsf {HongTu}$将顶点数据卸载到CPU内存中，并采用分区并行训练，将大图拆分并分配给多个gpu。为了以最佳性能减少运行时内存消耗，$\mathsf {HongTu}$使用混合解决方案，结合了重计算、缓存和计算重排序，实现高效的分层中间数据管理。为了解决由于分区之间的重复邻居访问而导致的通信增加，$\mathsf {HongTu}$采用了一个重复数据删除的通信框架，将主机- gpu传输转换为更有效的gpu间/ gpu内部数据访问。此外，$\mathsf {HongTu}$解决了内存不足全图训练中的负载不平衡问题，采用多目标图划分算法，平衡内存消耗和数据传输，最大化通信重复数据删除的有效性。在4× A100 GPU服务器上的实验表明，$\mathsf {HongTu}$可以有效地训练具有十亿条边的图，同时将主机-GPU数据通信减少25%至71%。与在16个CPU节点上运行的全图GNN系统相比，$\mathsf {HongTu}$实现了11.4到21.3倍的速度提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scalable and Load-Balanced Full-Graph GNN Training on Multiple GPUs

While full-graph training is effective for graph learning, it typically demands substantial memory resources. Existing multi-GPU training frameworks struggle with scalability because they require retaining data for each layer within GPU memory. In this work, we present

$\mathsf {HongTu }$

, a memory-efficient system that supports out-of-memory full-graph GNN training on GPUs.

$\mathsf {HongTu }$

offloads vertex data to CPU memory and employs partition parallelism training that splits and assigns large graphs to multiple GPUs. To reduce runtime memory consumption with optimal performance,

$\mathsf {HongTu }$

utilizes a hybrid solution combining recomputation, caching, and computation-reordering, enabling efficient layer-wise intermediate data management. To address the increased communication caused by duplicated neighbor access among partitions,

$\mathsf {HongTu }$

employs a deduplicated communication framework that converts host-GPU transfers into more efficient inter/intra-GPU data access. Additionally,

$\mathsf {HongTu }$

tackles the load-imbalance issues in out-of-memory full-graph training, featuring a multi-objective graph partition algorithm that balances memory consumption and data transfer and maximizes the effectiveness of communication deduplication. Experiments on a 4× A100 GPU server show that

$\mathsf {HongTu }$

can effectively train graphs with billion edges while reducing host-GPU data communication by 25% to 71% . Compared to the full-graph GNN system running on 16 CPU nodes,

$\mathsf {HongTu }$

achieves speedups ranging from 11.4× to 21.3×.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.