Topic models with power-law using Pitman-Yor process

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI:10.1145/1835804.1835890

Issei Sato, Hiroshi Nakagawa

{"title":"Topic models with power-law using Pitman-Yor process","authors":"Issei Sato, Hiroshi Nakagawa","doi":"10.1145/1835804.1835890","DOIUrl":null,"url":null,"abstract":"One important approach for knowledge discovery and data mining is to estimate unobserved variables because latent variables can indicate hidden specific properties of observed data. The latent factor model assumes that each item in a record has a latent factor; the co-occurrence of items can then be modeled by latent factors. In document modeling, a record indicates a document represented as a \"bag of words,\" meaning that the order of words is ignored, an item indicates a word and a latent factor indicates a topic. Latent Dirichlet allocation (LDA) is a widely used Bayesian topic model applying the Dirichlet distribution over the latent topic distribution of a document having multiple topics. LDA assumes that latent topics, i.e., discrete latent variables, are distributed according to a multinomial distribution whose parameters are generated from the Dirichlet distribution. LDA also models a word distribution by using a multinomial distribution whose parameters follows the Dirichlet distribution. This Dirichlet-multinomial setting, however, cannot capture the power-law phenomenon of a word distribution, which is known as Zipf's law in linguistics. We therefore propose a novel topic model using the Pitman-Yor(PY) process, called the PY topic model. The PY topic model captures two properties of a document; a power-law word distribution and the presence of multiple topics. In an experiment using real data, this model outperformed LDA in document modeling in terms of perplexity.","PeriodicalId":20529,"journal":{"name":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2010-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"75","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1835804.1835890","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 75

Abstract

One important approach for knowledge discovery and data mining is to estimate unobserved variables because latent variables can indicate hidden specific properties of observed data. The latent factor model assumes that each item in a record has a latent factor; the co-occurrence of items can then be modeled by latent factors. In document modeling, a record indicates a document represented as a "bag of words," meaning that the order of words is ignored, an item indicates a word and a latent factor indicates a topic. Latent Dirichlet allocation (LDA) is a widely used Bayesian topic model applying the Dirichlet distribution over the latent topic distribution of a document having multiple topics. LDA assumes that latent topics, i.e., discrete latent variables, are distributed according to a multinomial distribution whose parameters are generated from the Dirichlet distribution. LDA also models a word distribution by using a multinomial distribution whose parameters follows the Dirichlet distribution. This Dirichlet-multinomial setting, however, cannot capture the power-law phenomenon of a word distribution, which is known as Zipf's law in linguistics. We therefore propose a novel topic model using the Pitman-Yor(PY) process, called the PY topic model. The PY topic model captures two properties of a document; a power-law word distribution and the presence of multiple topics. In an experiment using real data, this model outperformed LDA in document modeling in terms of perplexity.

查看原文本刊更多论文

使用Pitman-Yor过程的幂律主题模型

对于知识发现和数据挖掘来说，一个重要的方法是估计未观察到的变量，因为潜在变量可以表示观察到的数据隐藏的特定属性。潜在因素模型假设记录中的每一项都有一个潜在因素;项目的共现可以通过潜在因素来建模。在文档建模中，一条记录表示一个表示为“单词包”的文档，这意味着忽略单词的顺序，一个项表示一个单词，一个潜在因素表示一个主题。潜在狄利克雷分配(Latent Dirichlet allocation, LDA)是一种广泛使用的贝叶斯主题模型，它将狄利克雷分布应用于具有多个主题的文档的潜在主题分布。LDA假设潜在主题(即离散潜在变量)按照多项分布分布，其参数由Dirichlet分布生成。LDA还通过使用参数遵循Dirichlet分布的多项分布来建模单词分布。然而，这种Dirichlet-multinomial设置不能捕捉单词分布的幂律现象，这在语言学中被称为齐夫定律。因此，我们提出了一种新的主题模型，使用Pitman-Yor(PY)过程，称为PY主题模型。PY主题模型捕获文档的两个属性;一个幂律词分布和多个主题的存在。在使用真实数据的实验中，该模型在困惑度方面优于LDA模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量