GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space

Sixth International Conference on Data Mining (ICDM'06) Pub Date : 2006-12-18 DOI:10.1109/ICDM.2006.79

Huahai He, Ambuj K. Singh

{"title":"GraphRank: Statistical Modeling and Mining of Significant Subgraphs in the Feature Space","authors":"Huahai He, Ambuj K. Singh","doi":"10.1109/ICDM.2006.79","DOIUrl":null,"url":null,"abstract":"We propose a technique for evaluating the statistical significance of frequent subgraphs in a database. A graph is represented by a feature vector that is a histogram over a set of basis elements. The set of basis elements is chosen based on domain knowledge and consists generally of vertices, edges, or small graphs. A given subgraph is transformed to a feature vector and the significance of the subgraph is computed by considering the significance of occurrence of the corresponding vector. The probability of occurrence of the vector in a random vector is computed based on the prior probability of the basis elements. This is then used to obtain a probability distribution on the support of the vector in a database of random vectors. The statistical significance of the vector/subgraph is then defined as the p-value of its observed support. We develop efficient methods for computing p-values and lower bounds. A simplified model is further proposed to improve the efficiency. We also address the problem of feature vector mining, a generalization of item- set mining where counts are associated with items and the goal is to find significant sub-vectors. We present an algorithm that explores closed frequent sub-vectors to find significant ones. Experimental results show that the proposed techniques are effective, efficient, and useful for ranking frequent subgraphs by their statistical significance.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"58","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sixth International Conference on Data Mining (ICDM'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2006.79","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 58

Abstract

We propose a technique for evaluating the statistical significance of frequent subgraphs in a database. A graph is represented by a feature vector that is a histogram over a set of basis elements. The set of basis elements is chosen based on domain knowledge and consists generally of vertices, edges, or small graphs. A given subgraph is transformed to a feature vector and the significance of the subgraph is computed by considering the significance of occurrence of the corresponding vector. The probability of occurrence of the vector in a random vector is computed based on the prior probability of the basis elements. This is then used to obtain a probability distribution on the support of the vector in a database of random vectors. The statistical significance of the vector/subgraph is then defined as the p-value of its observed support. We develop efficient methods for computing p-values and lower bounds. A simplified model is further proposed to improve the efficiency. We also address the problem of feature vector mining, a generalization of item- set mining where counts are associated with items and the goal is to find significant sub-vectors. We present an algorithm that explores closed frequent sub-vectors to find significant ones. Experimental results show that the proposed techniques are effective, efficient, and useful for ranking frequent subgraphs by their statistical significance.

查看原文本刊更多论文

GraphRank:特征空间中显著子图的统计建模和挖掘

我们提出了一种评估数据库中频繁子图的统计显著性的技术。图由特征向量表示，特征向量是一组基本元素上的直方图。基元素的集合是根据领域知识选择的，通常由顶点、边或小图组成。将给定的子图转换为特征向量，通过考虑对应向量出现的显著性来计算子图的显著性。向量在随机向量中出现的概率是基于基元素的先验概率来计算的。然后用它在随机向量数据库中获得支持向量的概率分布。然后将向量/子图的统计显著性定义为其观察到的支持度的p值。我们开发了计算p值和下界的有效方法。为了提高效率，进一步提出了一种简化模型。我们还解决了特征向量挖掘的问题，这是一种项目集挖掘的泛化，其中计数与项目相关联，目标是找到重要的子向量。我们提出了一种探索闭合频繁子向量以找到有效子向量的算法。实验结果表明，该方法对频繁子图的统计显著性排序是有效的、高效的和有用的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Sixth International Conference on Data Mining (ICDM'06)

自引率

0.00%

发文量