Topic Detection from Microblogs Using T-LDA and Perplexity

Ling Huang, Jinyu Ma, Chunling Chen
{"title":"Topic Detection from Microblogs Using T-LDA and Perplexity","authors":"Ling Huang, Jinyu Ma, Chunling Chen","doi":"10.1109/APSECW.2017.11","DOIUrl":null,"url":null,"abstract":"Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.","PeriodicalId":172357,"journal":{"name":"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSECW.2017.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 34

Abstract

Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.
基于T-LDA和Perplexity的微博话题检测
由于微博的篇幅短、数量多,传统的潜在狄利克雷分配(latent dirichlet allocation, LDA)方法无法有效地从微博内容中挖掘话题。本文引入术语频率-逆文档频率(Term Frequency- inverse Document Frequency, TF-IDF),该方法可以在不考虑文档中单词位置影响的情况下,快速调整单词权重并进行计算,以帮助在相对较短的文章中提取关键词。将LDA与TF-IDF相结合,提出了一种新的主题检测方法——T-LDA。此外,我们利用Perplexity-K曲线来帮助我们识别具有最大意义的主题数量(即k值),以减少人类在确定k值时的偏见。我们捕获了3407条中国微博,根据Perplexity-K曲线选择最乐观的k值,并对T-LDA、LDA和K-Means进行了一系列比较试验。我们发现,T-LDA在主题结果、建模时间、精度、召回率和F-Measure方面都优于LDA和K-Means,表明对LDA的改进是有效的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信