基于T-LDA和Perplexity的微博话题检测

2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW) Pub Date : 2017-12-01 DOI:10.1109/APSECW.2017.11

Ling Huang, Jinyu Ma, Chunling Chen

{"title":"基于T-LDA和Perplexity的微博话题检测","authors":"Ling Huang, Jinyu Ma, Chunling Chen","doi":"10.1109/APSECW.2017.11","DOIUrl":null,"url":null,"abstract":"Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.","PeriodicalId":172357,"journal":{"name":"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"Topic Detection from Microblogs Using T-LDA and Perplexity\",\"authors\":\"Ling Huang, Jinyu Ma, Chunling Chen\",\"doi\":\"10.1109/APSECW.2017.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.\",\"PeriodicalId\":172357,\"journal\":{\"name\":\"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APSECW.2017.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSECW.2017.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

摘要

由于微博的篇幅短、数量多，传统的潜在狄利克雷分配(latent dirichlet allocation, LDA)方法无法有效地从微博内容中挖掘话题。本文引入术语频率-逆文档频率(Term Frequency- inverse Document Frequency, TF-IDF)，该方法可以在不考虑文档中单词位置影响的情况下，快速调整单词权重并进行计算，以帮助在相对较短的文章中提取关键词。将LDA与TF-IDF相结合，提出了一种新的主题检测方法——T-LDA。此外，我们利用Perplexity-K曲线来帮助我们识别具有最大意义的主题数量(即k值)，以减少人类在确定k值时的偏见。我们捕获了3407条中国微博，根据Perplexity-K曲线选择最乐观的k值，并对T-LDA、LDA和K-Means进行了一系列比较试验。我们发现，T-LDA在主题结果、建模时间、精度、召回率和F-Measure方面都优于LDA和K-Means，表明对LDA的改进是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Topic Detection from Microblogs Using T-LDA and Perplexity

Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)

自引率

0.00%

发文量