Topic Modeling for Short Texts with Co-occurrence Frequency-Based Expansion

Gabriel Pedrosa, Marcelo Pita, Paulo Viana Bicalho, A. Lacerda, G. Pappa
{"title":"Topic Modeling for Short Texts with Co-occurrence Frequency-Based Expansion","authors":"Gabriel Pedrosa, Marcelo Pita, Paulo Viana Bicalho, A. Lacerda, G. Pappa","doi":"10.1109/BRACIS.2016.058","DOIUrl":null,"url":null,"abstract":"Short texts are everywhere on the Web, including messages in social media, status messages, etc, and extracting semantically meaningful topics from these collections is an important and difficult task. Topic modeling methods, such as Latent Dirichlet Allocation, were designed for this purpose. However, discovering high quality topics in short text collections is a challenging task. This is because most topic modeling methods rely on information coming from the word co-occurrence distribution in the collection to extract topics. As in short text this information is scarce, topic modeling methods have difficulties in this scenario, and different strategies to tackle this problem have been proposed in the literature. In this direction, this paper introduces a method for topic modeling of short texts that creates pseudo-documents representations from the original documents. The method is simple, effective, and considers word co-occurrence to expand documents, which can be given as input to any topic modeling algorithm. Experiments were run in four datasets and compared against state-of-the-art methods for extracting topics from short text. Results of coherence, NPMI and clustering metrics showed to be statistically significantly better than the baselines in the majority of cases.","PeriodicalId":183149,"journal":{"name":"2016 5th Brazilian Conference on Intelligent Systems (BRACIS)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 5th Brazilian Conference on Intelligent Systems (BRACIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BRACIS.2016.058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

Abstract

Short texts are everywhere on the Web, including messages in social media, status messages, etc, and extracting semantically meaningful topics from these collections is an important and difficult task. Topic modeling methods, such as Latent Dirichlet Allocation, were designed for this purpose. However, discovering high quality topics in short text collections is a challenging task. This is because most topic modeling methods rely on information coming from the word co-occurrence distribution in the collection to extract topics. As in short text this information is scarce, topic modeling methods have difficulties in this scenario, and different strategies to tackle this problem have been proposed in the literature. In this direction, this paper introduces a method for topic modeling of short texts that creates pseudo-documents representations from the original documents. The method is simple, effective, and considers word co-occurrence to expand documents, which can be given as input to any topic modeling algorithm. Experiments were run in four datasets and compared against state-of-the-art methods for extracting topics from short text. Results of coherence, NPMI and clustering metrics showed to be statistically significantly better than the baselines in the majority of cases.
基于共现频率展开的短文本主题建模
短文本在Web上无处不在,包括社交媒体中的消息、状态消息等,从这些集合中提取语义上有意义的主题是一项重要而困难的任务。主题建模方法,如Latent Dirichlet Allocation,就是为此目的而设计的。然而,在短文本集合中发现高质量的主题是一项具有挑战性的任务。这是因为大多数主题建模方法依赖于来自集合中单词共现分布的信息来提取主题。由于在短文本中这些信息是稀缺的,主题建模方法在这种情况下有困难,并且在文献中提出了不同的策略来解决这个问题。在这个方向上,本文介绍了一种短文本主题建模的方法,该方法从原始文档创建伪文档表示。该方法简单有效,并考虑词共现来展开文档,可作为任何主题建模算法的输入。实验在四个数据集上运行,并与最先进的从短文本中提取主题的方法进行比较。在大多数情况下,一致性、NPMI和聚类指标的结果在统计学上明显优于基线。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信