将流行度纳入社会网络分析的主题模型

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval Pub Date : 2013-07-28 DOI:10.1145/2484028.2484086

Youngchul Cha, Bin Bi, Chu-Cheng Hsieh, Junghoo Cho

{"title":"将流行度纳入社会网络分析的主题模型","authors":"Youngchul Cha, Bin Bi, Chu-Cheng Hsieh, Junghoo Cho","doi":"10.1145/2484028.2484086","DOIUrl":null,"url":null,"abstract":"Topic models are used to group words in a text dataset into a set of relevant topics. Unfortunately, when a few words frequently appear in a dataset, the topic groups identified by topic models become noisy because these frequent words repeatedly appear in \"irrelevant\" topic groups. This noise has not been a serious problem in a text dataset because the frequent words (e.g., the and is) do not have much meaning and have been simply removed before a topic model analysis. However, in a social network dataset we are interested in, they correspond to popular persons (e.g., Barack Obama and Justin Bieber) and cannot be simply removed because most people are interested in them. To solve this \"popularity problem\", we explicitly model the popularity of nodes (words) in topic models. For this purpose, we first introduce a notion of a \"popularity component\" and propose topic model extensions that effectively accommodate the popularity component. We evaluate the effectiveness of our models with a real-world Twitter dataset. Our proposed models achieve significantly lower perplexity (i.e., better prediction power) compared to the state-of-the-art baselines. In addition to the popularity problem caused by the nodes with high incoming edge degree, we also investigate the effect of the outgoing edge degree with another topic model extensions. We show that considering outgoing edge degree does not help much in achieving lower perplexity.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":"{\"title\":\"Incorporating popularity in topic models for social network analysis\",\"authors\":\"Youngchul Cha, Bin Bi, Chu-Cheng Hsieh, Junghoo Cho\",\"doi\":\"10.1145/2484028.2484086\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic models are used to group words in a text dataset into a set of relevant topics. Unfortunately, when a few words frequently appear in a dataset, the topic groups identified by topic models become noisy because these frequent words repeatedly appear in \\\"irrelevant\\\" topic groups. This noise has not been a serious problem in a text dataset because the frequent words (e.g., the and is) do not have much meaning and have been simply removed before a topic model analysis. However, in a social network dataset we are interested in, they correspond to popular persons (e.g., Barack Obama and Justin Bieber) and cannot be simply removed because most people are interested in them. To solve this \\\"popularity problem\\\", we explicitly model the popularity of nodes (words) in topic models. For this purpose, we first introduce a notion of a \\\"popularity component\\\" and propose topic model extensions that effectively accommodate the popularity component. We evaluate the effectiveness of our models with a real-world Twitter dataset. Our proposed models achieve significantly lower perplexity (i.e., better prediction power) compared to the state-of-the-art baselines. In addition to the popularity problem caused by the nodes with high incoming edge degree, we also investigate the effect of the outgoing edge degree with another topic model extensions. We show that considering outgoing edge degree does not help much in achieving lower perplexity.\",\"PeriodicalId\":178818,\"journal\":{\"name\":\"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"41\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2484028.2484086\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484028.2484086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

摘要

主题模型用于将文本数据集中的单词分组为一组相关主题。不幸的是，当一些单词频繁出现在数据集中时，由主题模型识别的主题组变得嘈杂，因为这些频繁出现的单词反复出现在“不相关”的主题组中。这种噪声在文本数据集中并不是一个严重的问题，因为频繁的单词(例如，and is)没有太多的意义，在主题模型分析之前已经被简单地删除了。然而，在我们感兴趣的社交网络数据集中，他们对应于受欢迎的人(例如，巴拉克·奥巴马和贾斯汀·比伯)，不能简单地删除，因为大多数人对他们感兴趣。为了解决这个“流行度问题”，我们在主题模型中显式地对节点(词)的流行度进行建模。为此，我们首先引入“流行度组件”的概念，并提出有效容纳流行度组件的主题模型扩展。我们用真实的Twitter数据集来评估模型的有效性。与最先进的基线相比，我们提出的模型显着降低了困惑度(即更好的预测能力)。除了高入边度节点的受欢迎程度问题外，我们还通过另一个主题模型扩展研究了出边度的影响。结果表明，考虑向外边缘度对实现较低的困惑度没有太大帮助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Incorporating popularity in topic models for social network analysis

Topic models are used to group words in a text dataset into a set of relevant topics. Unfortunately, when a few words frequently appear in a dataset, the topic groups identified by topic models become noisy because these frequent words repeatedly appear in "irrelevant" topic groups. This noise has not been a serious problem in a text dataset because the frequent words (e.g., the and is) do not have much meaning and have been simply removed before a topic model analysis. However, in a social network dataset we are interested in, they correspond to popular persons (e.g., Barack Obama and Justin Bieber) and cannot be simply removed because most people are interested in them. To solve this "popularity problem", we explicitly model the popularity of nodes (words) in topic models. For this purpose, we first introduce a notion of a "popularity component" and propose topic model extensions that effectively accommodate the popularity component. We evaluate the effectiveness of our models with a real-world Twitter dataset. Our proposed models achieve significantly lower perplexity (i.e., better prediction power) compared to the state-of-the-art baselines. In addition to the popularity problem caused by the nodes with high incoming edge degree, we also investigate the effect of the outgoing edge degree with another topic model extensions. We show that considering outgoing edge degree does not help much in achieving lower perplexity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

自引率

0.00%

发文量