(p,q)-biclique counting and enumeration for large sparse bipartite graphs.

IF 2.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Vldb Journal Pub Date : 2023-03-13 DOI:10.1007/s00778-023-00786-0

Jianye Yang, Yun Peng, Dian Ouyang, Wenjie Zhang, Xuemin Lin, Xiang Zhao

{"title":"(p,q)-biclique counting and enumeration for large sparse bipartite graphs.","authors":"Jianye Yang, Yun Peng, Dian Ouyang, Wenjie Zhang, Xuemin Lin, Xiang Zhao","doi":"10.1007/s00778-023-00786-0","DOIUrl":null,"url":null,"abstract":"In this paper, we study the problem of (<math><mi>p</mi></math>, <math><mi>q</mi></math>)-biclique counting and enumeration for large sparse bipartite graphs. Given a bipartite graph <math><mrow><mi>G</mi><mo>=</mo><mo>(</mo><mi>U</mi><mo>,</mo><mi>V</mi><mo>,</mo><mi>E</mi><mo>)</mo></mrow></math> and two integer parameters p and q, we aim to efficiently count and enumerate all (<math><mi>p</mi></math>, <math><mi>q</mi></math>)-bicliques in G, where a (<math><mi>p</mi></math>, <math><mi>q</mi></math>)-biclique B(L, R) is a complete subgraph of G with <math><mrow><mi>L</mi><mo>⊆</mo><mi>U</mi></mrow></math>, <math><mrow><mi>R</mi><mo>⊆</mo><mi>V</mi></mrow></math>, <math><mrow><mo>|</mo><mi>L</mi><mo>|</mo><mo>=</mo><mi>p</mi></mrow></math>, and <math><mrow><mo>|</mo><mi>R</mi><mo>|</mo><mo>=</mo><mi>q</mi></mrow></math>. The problem of (<math><mi>p</mi></math>, <math><mi>q</mi></math>)-biclique counting and enumeration has many applications, such as graph neural network information aggregation, densest subgraph detection, and cohesive subgroup analysis. Despite the wide range of applications, to the best of our knowledge, we note that there is no efficient and scalable solution to this problem in the literature . This problem is computationally challenging, due to the worst-case exponential number of (<math><mi>p</mi></math>, <math><mi>q</mi></math>)-bicliques. In this paper, we propose a competitive branch-and-bound baseline method, namely BCList, which explores the search space in a depth-first manner, together with a variety of pruning techniques. Although BCList offers a useful computation framework to our problem, its worst-case time complexity is exponential to <math><mrow><mi>p</mi><mo>+</mo><mi>q</mi></mrow></math>. To alleviate this, we propose an advanced approach, called BCList++. Particularly, BCList++ applies a layer-based exploring strategy to enumerate (<math><mi>p</mi></math>, <math><mi>q</mi></math>)-bicliques by anchoring the search on either U or V only, which has a worst-case time complexity exponential to either p or q only. Consequently, a vital task is to choose a layer with the least computation cost. To this end, we develop a cost model, which is built upon an unbiased estimator for the density of 2-hop graph induced by U or V. To improve computation efficiency, BCList++ exploits pre-allocated arrays and vertex labeling techniques such that the frequent subgraph creating operations can be substituted by array element switching operations. We conduct extensive experiments on 16 real-life datasets, and the experimental results demonstrate that BCList++ significantly outperforms the baseline methods by up to 3 orders of magnitude. We show via a case study that (<math><mi>p</mi></math>, <math><mi>q</mi></math>)-bicliques optimizes the efficiency of graph neural networks. In this paper, we extend our techniques to count and enumerate (<math><mi>p</mi></math>, <math><mi>q</mi></math>)-bicliques on uncertain bipartite graphs. An efficient method IUBCList is developed on the top of BCList++, together with a couple of pruning techniques, including common neighbor refinement and search branch early termination, to discard unpromising uncertain (<math><mi>p</mi></math>, <math><mi>q</mi></math>)-bicliques early. The experimental results demonstrate that IUBCList significantly outperforms the baseline method by up to 2 orders of magnitude.","PeriodicalId":49373,"journal":{"name":"Vldb Journal","volume":" ","pages":"1-25"},"PeriodicalIF":2.8000,"publicationDate":"2023-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10008723/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vldb Journal","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00778-023-00786-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we study the problem of ( $p$ , $q$ )-biclique counting and enumeration for large sparse bipartite graphs. Given a bipartite graph $G = (U, V, E)$ and two integer parameters p and q, we aim to efficiently count and enumerate all ( $p$ , $q$ )-bicliques in G, where a ( $p$ , $q$ )-biclique B(L, R) is a complete subgraph of G with $L \subseteq U$ , $R \subseteq V$ , $| L | = p$ , and $| R | = q$ . The problem of ( $p$ , $q$ )-biclique counting and enumeration has many applications, such as graph neural network information aggregation, densest subgraph detection, and cohesive subgroup analysis. Despite the wide range of applications, to the best of our knowledge, we note that there is no efficient and scalable solution to this problem in the literature . This problem is computationally challenging, due to the worst-case exponential number of ( $p$ , $q$ )-bicliques. In this paper, we propose a competitive branch-and-bound baseline method, namely BCList, which explores the search space in a depth-first manner, together with a variety of pruning techniques. Although BCList offers a useful computation framework to our problem, its worst-case time complexity is exponential to $p + q$ . To alleviate this, we propose an advanced approach, called BCList++. Particularly, BCList++ applies a layer-based exploring strategy to enumerate ( $p$ , $q$ )-bicliques by anchoring the search on either U or V only, which has a worst-case time complexity exponential to either p or q only. Consequently, a vital task is to choose a layer with the least computation cost. To this end, we develop a cost model, which is built upon an unbiased estimator for the density of 2-hop graph induced by U or V. To improve computation efficiency, BCList++ exploits pre-allocated arrays and vertex labeling techniques such that the frequent subgraph creating operations can be substituted by array element switching operations. We conduct extensive experiments on 16 real-life datasets, and the experimental results demonstrate that BCList++ significantly outperforms the baseline methods by up to 3 orders of magnitude. We show via a case study that ( $p$ , $q$ )-bicliques optimizes the efficiency of graph neural networks. In this paper, we extend our techniques to count and enumerate ( $p$ , $q$ )-bicliques on uncertain bipartite graphs. An efficient method IUBCList is developed on the top of BCList++, together with a couple of pruning techniques, including common neighbor refinement and search branch early termination, to discard unpromising uncertain ( $p$ , $q$ )-bicliques early. The experimental results demonstrate that IUBCList significantly outperforms the baseline method by up to 2 orders of magnitude.

Abstract Image

查看原文本刊更多论文

大型稀疏二分图的（p，q）-二分计数和枚举。

本文研究了大型稀疏二分图的（p，q）-二分计数和枚举问题。给定一个二分图G=（U，V，E）和两个整数参数p和q，我们的目标是有效地计数和枚举G中的所有（p，q）-二分图，其中a（p，q）-二等分图B（L，R）是G的一个完整子图，具有L⊆U，R \8838V，|L|=p，|R|=q。（p，q）-二重计数和枚举问题有许多应用，如图神经网络信息聚合、最密子图检测和内聚子群分析。尽管应用范围很广，但据我们所知，我们注意到文献中没有有效和可扩展的解决方案来解决这个问题。这个问题在计算上具有挑战性，因为最坏情况下（p，q）-二进制的指数数。在本文中，我们提出了一种竞争性的分枝定界基线方法，即BCList，它以深度优先的方式探索搜索空间，并结合了各种修剪技术。尽管BCList为我们的问题提供了一个有用的计算框架，但其最坏情况下的时间复杂度是p+q的指数。为了缓解这种情况，我们提出了一种高级方法，称为BCList++。特别地，BCList++应用基于层的探索策略，通过仅在U或V上锚定搜索来枚举（p，q）-二进制，这具有仅为p或q的最坏情况时间复杂度指数。因此，至关重要的任务是选择一个计算成本最低的层。为此，我们开发了一个成本模型，该模型建立在U或V引起的2-拓扑图密度的无偏估计器的基础上。为了提高计算效率，BCList++利用了预分配的数组和顶点标记技术，使得频繁的子图创建操作可以用数组元素切换操作来代替。我们在16个真实数据集上进行了广泛的实验，实验结果表明BCList++显著优于基线方法高达3个数量级。我们通过一个案例研究表明，（p，q）-biliques优化了图神经网络的效率。在本文中，我们将我们的技术扩展到不确定二分图上的（p，q）-二分图的计数和枚举。在BCList++的基础上，开发了一种有效的方法IUBCList，并结合了一些修剪技术，包括公共邻居细化和搜索分支提前终止，以提前丢弃不确定的（p，q）-bilique。实验结果表明，IUBCList显著优于基线方法高达2个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Vldb Journal 工程技术-计算机：信息系统

CiteScore

12.30

自引率

4.80%

发文量

审稿时长

>12 weeks

期刊介绍： The journal is dedicated to the publication of scholarly contributions in areas of data management such as database system technology and information systems, including their architectures and applications. Further, the journal’s scope is restricted to areas of data management that are covered by the combined expertise of the journal’s editorial board. Submissions with a substantial theory component are welcome, but the VLDB Journal expects such submissions also to embody a systems component. In relation to data mining, the journal will handle submissions where systems issues play a significant role. Factors that we use to determine whether a data mining paper is within scope include: The submission targets systems issues in relation to data mining, e.g., by covering integration with a database engine or with other data management functionality. The submission’s contributions build on (rather than simply cite) work already published in database outlets, e.g., VLDBJ, ACM TODS, PVLDB, ACM SIGMOD, IEEE ICDE, EDBT. The journal''s editorial board has the necessary expertise on the submission''s topic. Traditional, stand-alone data mining papers that lack the above or similar characteristics are out of scope for this journal. Criteria similar to the above are applied to submission from other areas, e.g., information retrieval and geographical information systems.