Topic Modeling for Short Texts with Co-occurrence Frequency-Based Expansion

2016 5th Brazilian Conference on Intelligent Systems (BRACIS) Pub Date : 2016-10-01 DOI:10.1109/BRACIS.2016.058

Gabriel Pedrosa, Marcelo Pita, Paulo Viana Bicalho, A. Lacerda, G. Pappa

{"title":"Topic Modeling for Short Texts with Co-occurrence Frequency-Based Expansion","authors":"Gabriel Pedrosa, Marcelo Pita, Paulo Viana Bicalho, A. Lacerda, G. Pappa","doi":"10.1109/BRACIS.2016.058","DOIUrl":null,"url":null,"abstract":"Short texts are everywhere on the Web, including messages in social media, status messages, etc, and extracting semantically meaningful topics from these collections is an important and difficult task. Topic modeling methods, such as Latent Dirichlet Allocation, were designed for this purpose. However, discovering high quality topics in short text collections is a challenging task. This is because most topic modeling methods rely on information coming from the word co-occurrence distribution in the collection to extract topics. As in short text this information is scarce, topic modeling methods have difficulties in this scenario, and different strategies to tackle this problem have been proposed in the literature. In this direction, this paper introduces a method for topic modeling of short texts that creates pseudo-documents representations from the original documents. The method is simple, effective, and considers word co-occurrence to expand documents, which can be given as input to any topic modeling algorithm. Experiments were run in four datasets and compared against state-of-the-art methods for extracting topics from short text. Results of coherence, NPMI and clustering metrics showed to be statistically significantly better than the baselines in the majority of cases.","PeriodicalId":183149,"journal":{"name":"2016 5th Brazilian Conference on Intelligent Systems (BRACIS)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 5th Brazilian Conference on Intelligent Systems (BRACIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BRACIS.2016.058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Short texts are everywhere on the Web, including messages in social media, status messages, etc, and extracting semantically meaningful topics from these collections is an important and difficult task. Topic modeling methods, such as Latent Dirichlet Allocation, were designed for this purpose. However, discovering high quality topics in short text collections is a challenging task. This is because most topic modeling methods rely on information coming from the word co-occurrence distribution in the collection to extract topics. As in short text this information is scarce, topic modeling methods have difficulties in this scenario, and different strategies to tackle this problem have been proposed in the literature. In this direction, this paper introduces a method for topic modeling of short texts that creates pseudo-documents representations from the original documents. The method is simple, effective, and considers word co-occurrence to expand documents, which can be given as input to any topic modeling algorithm. Experiments were run in four datasets and compared against state-of-the-art methods for extracting topics from short text. Results of coherence, NPMI and clustering metrics showed to be statistically significantly better than the baselines in the majority of cases.

查看原文本刊更多论文

基于共现频率展开的短文本主题建模

短文本在Web上无处不在，包括社交媒体中的消息、状态消息等，从这些集合中提取语义上有意义的主题是一项重要而困难的任务。主题建模方法，如Latent Dirichlet Allocation，就是为此目的而设计的。然而，在短文本集合中发现高质量的主题是一项具有挑战性的任务。这是因为大多数主题建模方法依赖于来自集合中单词共现分布的信息来提取主题。由于在短文本中这些信息是稀缺的，主题建模方法在这种情况下有困难，并且在文献中提出了不同的策略来解决这个问题。在这个方向上，本文介绍了一种短文本主题建模的方法，该方法从原始文档创建伪文档表示。该方法简单有效，并考虑词共现来展开文档，可作为任何主题建模算法的输入。实验在四个数据集上运行，并与最先进的从短文本中提取主题的方法进行比较。在大多数情况下，一致性、NPMI和聚类指标的结果在统计学上明显优于基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 5th Brazilian Conference on Intelligent Systems (BRACIS)

自引率

0.00%

发文量