{"title":"Topic Detection from Microblogs Using T-LDA and Perplexity","authors":"Ling Huang, Jinyu Ma, Chunling Chen","doi":"10.1109/APSECW.2017.11","DOIUrl":null,"url":null,"abstract":"Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.","PeriodicalId":172357,"journal":{"name":"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSECW.2017.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 34
Abstract
Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.