{"title":"基于T-LDA和Perplexity的微博话题检测","authors":"Ling Huang, Jinyu Ma, Chunling Chen","doi":"10.1109/APSECW.2017.11","DOIUrl":null,"url":null,"abstract":"Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.","PeriodicalId":172357,"journal":{"name":"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"Topic Detection from Microblogs Using T-LDA and Perplexity\",\"authors\":\"Ling Huang, Jinyu Ma, Chunling Chen\",\"doi\":\"10.1109/APSECW.2017.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.\",\"PeriodicalId\":172357,\"journal\":{\"name\":\"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APSECW.2017.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 24th Asia-Pacific Software Engineering Conference Workshops (APSECW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSECW.2017.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Topic Detection from Microblogs Using T-LDA and Perplexity
Due to the short-form and large amount of microblogs, traditional latent dirichlet allocation (LDA) cannot be effectively applied to mining topics from the microblog contents. In this paper, we bring in Term Frequency-Inverse Document Frequency (TF-IDF) that can adjust the weight of words and calculate in a high speed without considering the influence of word positions in documents, to help extract the key words in a relatively short length of article. Combining LDA with TF-IDF, we come up with a new topic detection method named T-LDA. In addition, we utilize Perplexity-K curve to help us recognize the number of topics (i.e. K-value) with the maximum meaningfulness, in order to reduce human bias in deciding K-value. We captured 3407 Chinese microblogs, chose the most optimistic K-value according to Perplexity-K curve, and conducted a series comparative trials among T-LDA, LDA and K-Means. We found that T-LDA has a better performance than LDA and K-Means in terms of topics results, modeling time, Precision, Recall rate, and F-Measure, which indicates that the improvement on LDA is effective.