FICOM: an effective and scalable active learning framework for GNNs on semi-supervised node classification

The VLDB Journal Pub Date : 2024-07-22 DOI:10.1007/s00778-024-00870-z

Xingyi Zhang, Jinchao Huang, Fangyuan Zhang, Sibo Wang

{"title":"FICOM: an effective and scalable active learning framework for GNNs on semi-supervised node classification","authors":"Xingyi Zhang, Jinchao Huang, Fangyuan Zhang, Sibo Wang","doi":"10.1007/s00778-024-00870-z","DOIUrl":null,"url":null,"abstract":"Active learning for graph neural networks (GNNs) aims to select B nodes to label for the best possible GNN performance. Carefully selected labeled nodes can help improve GNN performance and hence motivates a line of research works. Unfortunately, existing methods still provide inferior GNN performance or cannot scale to large networks.Motivated by these limitations, in this paper, we present FICOM, an effective and scalable GNN active learning framework. Firstly, we formulate the node selection as an optimization problem where we consider the importance of a node from (i) the importance of a node during the feature propagation with a connection to the personalized PageRank (PPR), and (ii) the diversity of a node brings in the embedding space generated by feature propagation. We show that the defined problem is submodular, and a greedy solution can provide a \\((1-1/e)\\)-approximate solution.However, a standard greedy solution requires getting the node with the maximum marginal gain of the objective score in each iteration, which incurs a prohibitive running cost and cannot scale to large datasets. As our main contribution, we present FICOM, an efficient and scalable solution that provides \\((1-1/e)\\)-approximation guarantee and scales to graphs with millions of nodes on a single machine. The main idea is that we adaptively maintain the lower- and upper-bound of the marginal gain for each node v. In each iteration, we can first derive a small subset of candidate nodes and then compute the exact score for this subset of candidate nodes so that we can find the node with the maximum marginal gain efficiently. Extensive experiments on six benchmark datasets using four GNNs, including GCN, SGC, APPNP, and GCNII, show that our FICOM consistently outperforms existing active learning approaches on semi-supervised node classification tasks using different GNNs. Moreover, our solution can finish within 5 h on a million-node graph.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The VLDB Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00778-024-00870-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Active learning for graph neural networks (GNNs) aims to select B nodes to label for the best possible GNN performance. Carefully selected labeled nodes can help improve GNN performance and hence motivates a line of research works. Unfortunately, existing methods still provide inferior GNN performance or cannot scale to large networks.Motivated by these limitations, in this paper, we present FICOM, an effective and scalable GNN active learning framework. Firstly, we formulate the node selection as an optimization problem where we consider the importance of a node from (i) the importance of a node during the feature propagation with a connection to the personalized PageRank (PPR), and (ii) the diversity of a node brings in the embedding space generated by feature propagation. We show that the defined problem is submodular, and a greedy solution can provide a \((1-1/e)\)-approximate solution.However, a standard greedy solution requires getting the node with the maximum marginal gain of the objective score in each iteration, which incurs a prohibitive running cost and cannot scale to large datasets. As our main contribution, we present FICOM, an efficient and scalable solution that provides \((1-1/e)\)-approximation guarantee and scales to graphs with millions of nodes on a single machine. The main idea is that we adaptively maintain the lower- and upper-bound of the marginal gain for each node v. In each iteration, we can first derive a small subset of candidate nodes and then compute the exact score for this subset of candidate nodes so that we can find the node with the maximum marginal gain efficiently. Extensive experiments on six benchmark datasets using four GNNs, including GCN, SGC, APPNP, and GCNII, show that our FICOM consistently outperforms existing active learning approaches on semi-supervised node classification tasks using different GNNs. Moreover, our solution can finish within 5 h on a million-node graph.

Abstract Image

查看原文本刊更多论文

FICOM：半监督节点分类中有效且可扩展的 GNN 主动学习框架

图神经网络（GNN）的主动学习旨在选择 B 节点进行标注，以尽可能提高 GNN 性能。精心选择的标记节点有助于提高图神经网络的性能，因此激发了一系列研究工作。鉴于这些局限性，我们在本文中提出了一个有效且可扩展的 GNN 主动学习框架 FICOM。首先，我们将节点选择表述为一个优化问题，我们从以下两个方面考虑节点的重要性：(i) 在特征传播过程中节点的重要性与个性化 PageRank（PPR）的联系；(ii) 由特征传播产生的嵌入空间中节点带来的多样性。然而，标准的贪婪解法需要在每次迭代中获取目标分数边际收益最大的节点，这将产生过高的运行成本，并且无法扩展到大型数据集。作为我们的主要贡献，我们提出了一种高效、可扩展的解决方案--FICOM，它提供了 \((1-1/e)\)-approximation 保证，并可在单台机器上扩展到拥有数百万节点的图。在每次迭代中，我们可以先得出一小部分候选节点，然后计算这部分候选节点的精确得分，从而高效地找到边际收益最大的节点。在使用四种 GNN（包括 GCN、SGC、APPNP 和 GCNII）的六个基准数据集上进行的广泛实验表明，在使用不同 GNN 的半监督节点分类任务中，我们的 FICOM 始终优于现有的主动学习方法。此外，我们的解决方案可以在 5 小时内完成百万节点图的分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The VLDB Journal

自引率

0.00%

发文量