Improving LDA topic models for microblogs via tweet pooling and automatic labeling

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval Pub Date : 2013-07-28 DOI:10.1145/2484028.2484166

Rishabh Mehrotra, S. Sanner, Wray L. Buntine, Lexing Xie

{"title":"Improving LDA topic models for microblogs via tweet pooling and automatic labeling","authors":"Rishabh Mehrotra, S. Sanner, Wray L. Buntine, Lexing Xie","doi":"10.1145/2484028.2484166","DOIUrl":null,"url":null,"abstract":"Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"474","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484028.2484166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 474

Abstract

Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.

查看原文本刊更多论文

基于tweet池和自动标注的微博LDA主题模型改进

Twitter，或140个字符的世界，对主题模型在短小杂乱的文本上的有效性提出了严峻的挑战。虽然潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)等主题模型在新闻文章和学术摘要的成功应用方面有着悠久的历史，但它们在应用于Twitter等微博内容时往往不那么连贯。在本文中，我们研究了在不修改LDA基本机制的情况下改进从Twitter内容中学习的主题的方法;我们通过各种池化方案来实现这一点，这些方案在LDA的数据预处理步骤中聚合tweet。我们通过经验证明，与未修改的LDA基线和各种池化方案相比，通过标签进行tweet池化的新方法可以在三个不同Twitter数据集的各种主题一致性度量方面取得巨大进步。自动标签标记的另一个贡献是进一步改进了指标子集的标签池结果。总的来说，这两种新方案显著改善了Twitter内容的LDA主题模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

自引率

0.00%

发文量