信息网络中社区异常值及其有效检测

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI:10.1145/1835804.1835907

Jing Gao, Feng Liang, Wei Fan, Chi Wang, Yizhou Sun, Jiawei Han

{"title":"信息网络中社区异常值及其有效检测","authors":"Jing Gao, Feng Liang, Wei Fan, Chi Wang, Yizhou Sun, Jiawei Han","doi":"10.1145/1835804.1835907","DOIUrl":null,"url":null,"abstract":"Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, co-authorship and citation information, blog data, movie reviews and so on. In these datasets (called \"information networks\"), closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored. An example could be a low-income person being friends with many rich people even though his income is not anomalously low when considered over the entire population. This paper first introduces the concept of community outliers (interesting points or rising stars for a more positive sense), and then shows that well-known baseline approaches without considering links or community information cannot find these community outliers. We propose an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers. The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF). Maximizing the data likelihood and the posterior of the model gives the solution to the outlier inference problem. We apply the model on both synthetic data and DBLP data sets, and the results demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"267","resultStr":"{\"title\":\"On community outliers and their efficient detection in information networks\",\"authors\":\"Jing Gao, Feng Liang, Wei Fan, Chi Wang, Yizhou Sun, Jiawei Han\",\"doi\":\"10.1145/1835804.1835907\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, co-authorship and citation information, blog data, movie reviews and so on. In these datasets (called \\\"information networks\\\"), closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored. An example could be a low-income person being friends with many rich people even though his income is not anomalously low when considered over the entire population. This paper first introduces the concept of community outliers (interesting points or rising stars for a more positive sense), and then shows that well-known baseline approaches without considering links or community information cannot find these community outliers. We propose an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers. The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF). Maximizing the data likelihood and the posterior of the model gives the solution to the outlier inference problem. We apply the model on both synthetic data and DBLP data sets, and the results demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.\",\"PeriodicalId\":20529,\"journal\":{\"name\":\"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining\",\"volume\":\"22 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"267\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1835804.1835907\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1835804.1835907","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 267

摘要

链接或网络数据在许多应用程序中无处不在。例子包括通过超链接连接的网络数据或超文本文档，通过朋友链接连接的社交网络或用户档案，合著者和引文信息，博客数据，电影评论等等。在这些数据集(称为“信息网络”)中，具有相同属性或兴趣的密切相关的对象形成了一个社区。例如，博客圈中的社区可能是对手机评论和新闻最感兴趣的用户。信息网络中的离群点检测可以揭示出在忽略群体信息的情况下不明显的重要异常和有趣行为。一个例子是，一个低收入的人与许多富人交朋友，即使他的收入与整个人口相比并不是异常低。本文首先介绍了社区异常值的概念(有趣点或更积极意义上的新星)，然后表明，没有考虑链接或社区信息的已知基线方法无法找到这些社区异常值。我们提出了一种有效的解决方案，将网络数据建模为由多个正常群落和一组随机生成的异常值组成的混合模型。该概率模型基于隐马尔可夫随机场(HMRF)定义数据和链路的联合分布，同时表征数据和链路。利用数据的似然性和模型的后验性最大化来解决离群值推理问题。我们将该模型应用于合成数据和DBLP数据集，结果表明了这一概念的重要性，以及所提出方法的有效性和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

On community outliers and their efficient detection in information networks

Linked or networked data are ubiquitous in many applications. Examples include web data or hypertext documents connected via hyperlinks, social networks or user profiles connected via friend links, co-authorship and citation information, blog data, movie reviews and so on. In these datasets (called "information networks"), closely related objects that share the same properties or interests form a community. For example, a community in blogsphere could be users mostly interested in cell phone reviews and news. Outlier detection in information networks can reveal important anomalous and interesting behaviors that are not obvious if community information is ignored. An example could be a low-income person being friends with many rich people even though his income is not anomalously low when considered over the entire population. This paper first introduces the concept of community outliers (interesting points or rising stars for a more positive sense), and then shows that well-known baseline approaches without considering links or community information cannot find these community outliers. We propose an efficient solution by modeling networked data as a mixture model composed of multiple normal communities and a set of randomly generated outliers. The probabilistic model characterizes both data and links simultaneously by defining their joint distribution based on hidden Markov random fields (HMRF). Maximizing the data likelihood and the posterior of the model gives the solution to the outlier inference problem. We apply the model on both synthetic data and DBLP data sets, and the results demonstrate importance of this concept, as well as the effectiveness and efficiency of the proposed approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量