TreeSpan: efficiently computing similarity all-matching

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data Pub Date : 2012-05-20 DOI:10.1145/2213836.2213896

Gaoping Zhu, Xuemin Lin, Ke Zhu, W. Zhang, J. Yu

{"title":"TreeSpan: efficiently computing similarity all-matching","authors":"Gaoping Zhu, Xuemin Lin, Ke Zhu, W. Zhang, J. Yu","doi":"10.1145/2213836.2213896","DOIUrl":null,"url":null,"abstract":"Given a query graph $q$ and a data graph G, computing all occurrences of q in G, namely exact all-matching, is fundamental in graph data analysis with a wide spectrum of real applications. It is challenging since even finding one occurrence of q in G (subgraph isomorphism test) is NP-Complete. Consider that in many real applications, exploratory queries from users are often inaccurate to express their real demands. In this paper, we study the problem of efficiently computing all approximate occurrences of q in G. Particularly, we study the problem of efficiently retrieving all matches of q in G with the number of possible missing edges bounded by a given threshold θ, namely similarity all-matching. The problem of similarity all-matching is harder than the problem of exact all-matching since it covers the problem of exact all-matching as a special case with θ = 0. In this paper, we develop a novel paradigm to conduct similarity all-matching. Specifically, we propose to use a minimal set QT of spanning trees in q to cover all connected subgraphs q' of q missing at most θ edges; that is, each q' is spanned by a spanning tree in QT. Then, we conduct exact all-matching for each spanning tree in QT to induce all similarity matches. A rigid theoretic analysis shows that our new search paradigm significantly reduces the times of conducting exact all-matching against the existing techniques. To further speed-up the computation, we develop new filtering, computation sharing, and search ordering techniques. Our comprehensive experiments on both real and synthetic datasets demonstrate that our techniques outperform the state of the art technique by 7 orders of magnitude.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"196 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2213836.2213896","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

Abstract

Given a query graph $q$ and a data graph G, computing all occurrences of q in G, namely exact all-matching, is fundamental in graph data analysis with a wide spectrum of real applications. It is challenging since even finding one occurrence of q in G (subgraph isomorphism test) is NP-Complete. Consider that in many real applications, exploratory queries from users are often inaccurate to express their real demands. In this paper, we study the problem of efficiently computing all approximate occurrences of q in G. Particularly, we study the problem of efficiently retrieving all matches of q in G with the number of possible missing edges bounded by a given threshold θ, namely similarity all-matching. The problem of similarity all-matching is harder than the problem of exact all-matching since it covers the problem of exact all-matching as a special case with θ = 0. In this paper, we develop a novel paradigm to conduct similarity all-matching. Specifically, we propose to use a minimal set QT of spanning trees in q to cover all connected subgraphs q' of q missing at most θ edges; that is, each q' is spanned by a spanning tree in QT. Then, we conduct exact all-matching for each spanning tree in QT to induce all similarity matches. A rigid theoretic analysis shows that our new search paradigm significantly reduces the times of conducting exact all-matching against the existing techniques. To further speed-up the computation, we develop new filtering, computation sharing, and search ordering techniques. Our comprehensive experiments on both real and synthetic datasets demonstrate that our techniques outperform the state of the art technique by 7 orders of magnitude.

查看原文本刊更多论文

TreeSpan:高效计算相似度全匹配

给定一个查询图$q$和一个数据图G，计算q在G中出现的所有情况，即精确的全匹配，是具有广泛实际应用的图数据分析的基础。这是具有挑战性的，因为即使找到q在G(子图同构检验)中的一个出现也是np完全的。考虑到在许多实际应用程序中，来自用户的探索性查询通常不能准确地表达他们的实际需求。本文研究了有效地计算q在G中所有近似出现的问题，特别是研究了有效地检索G中q在给定阈值θ范围内的所有可能缺失边数的匹配问题，即相似性全匹配问题。相似性全匹配问题比精确全匹配问题更难，因为它包含了作为θ = 0的特殊情况的精确全匹配问题。在本文中，我们开发了一种新的范式来进行相似性全匹配。具体地说，我们建议使用q中的生成树的最小集合QT来覆盖所有不超过θ条边的连通子图q';即QT中的每一个q′都由一棵生成树生成，然后我们对QT中的每一棵生成树进行精确的全匹配，得到所有的相似匹配。严格的理论分析表明，我们的新搜索范式与现有技术相比，大大减少了进行精确全匹配的次数。为了进一步加快计算速度，我们开发了新的过滤、计算共享和搜索排序技术。我们在真实和合成数据集上的综合实验表明，我们的技术比最先进的技术高出7个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量