Ruize Shi;Hong Huang;Xue Lin;Kehan Yin;Wei Zhou;Hai Jin
{"title":"Efficient Learning for Billion-Scale Heterogeneous Information Networks","authors":"Ruize Shi;Hong Huang;Xue Lin;Kehan Yin;Wei Zhou;Hai Jin","doi":"10.1109/TBDATA.2024.3428331","DOIUrl":null,"url":null,"abstract":"<i>Heterogeneous graph neural networks (HGNNs)</i> excel at understanding <i>heterogeneous information networks</i> (HINs) and have demonstrated state-of-the-art performance across numerous tasks. However, previous works tend to study small datasets, which deviate significantly from real-world scenarios. More specifically, their heterogeneous message passing results in substantial memory and time overheads, as it requires aggregating heterogeneous neighbor features multiple times. To address this, we propose an <i>Efficient Heterogeneous Graph Neural Network</i> (EHGNN) that leverages <i>heterogeneous personalized PageRank</i> (HPPR) to preserve the influence between all nodes, then approximates message passing and selectively loads neighbor information for one aggregation, significantly reducing memory and time usage. In addition, we employ some lightweight techniques to ensure the performance of EHGNN. Evaluations on various HIN benchmarks in node classification and link prediction tasks unequivocally establish the superiority of EHGNN, surpassing the State-of-the-Art by 11<inline-formula><tex-math>$\\%$</tex-math></inline-formula> in terms of performance. In addition, EHGNN achieves a remarkable 400<inline-formula><tex-math>$\\%$</tex-math></inline-formula> boost in training and inference speed while utilizing less memory. Notably, EHGNN can handle a 200-million-node, 1-billion-link HIN within 18 hours on a single machine, using only 170 GB of memory, which is much lower than the previous minimum requirement of 600 GB.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"748-760"},"PeriodicalIF":7.5000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10598347","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10598347/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Heterogeneous graph neural networks (HGNNs) excel at understanding heterogeneous information networks (HINs) and have demonstrated state-of-the-art performance across numerous tasks. However, previous works tend to study small datasets, which deviate significantly from real-world scenarios. More specifically, their heterogeneous message passing results in substantial memory and time overheads, as it requires aggregating heterogeneous neighbor features multiple times. To address this, we propose an Efficient Heterogeneous Graph Neural Network (EHGNN) that leverages heterogeneous personalized PageRank (HPPR) to preserve the influence between all nodes, then approximates message passing and selectively loads neighbor information for one aggregation, significantly reducing memory and time usage. In addition, we employ some lightweight techniques to ensure the performance of EHGNN. Evaluations on various HIN benchmarks in node classification and link prediction tasks unequivocally establish the superiority of EHGNN, surpassing the State-of-the-Art by 11$\%$ in terms of performance. In addition, EHGNN achieves a remarkable 400$\%$ boost in training and inference speed while utilizing less memory. Notably, EHGNN can handle a 200-million-node, 1-billion-link HIN within 18 hours on a single machine, using only 170 GB of memory, which is much lower than the previous minimum requirement of 600 GB.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.