Sameendra Samarawickrama, S. Karunasekera, A. Harwood
{"title":"Finding High-Level Topics and Tweet Labeling Using Topic Models","authors":"Sameendra Samarawickrama, S. Karunasekera, A. Harwood","doi":"10.1109/ICPADS.2015.38","DOIUrl":null,"url":null,"abstract":"Making sense of Twitter data streams is challenging due to the extremely high volume of data. One way to address this challenge is to consider these data streams as containing a set of high-level topics. In this research we address the problem of: given a collection of tweets about K high-level topics, how to find topic words that describe these topics as well as how to label each tweet with one of the K topics using a topic modeling approach. Current research has shown that applying topic modeling algorithms directly on tweets does not lead to good results. Hence one approach is to group related tweets together, so as to form a single “pseudo-document”, which is more informative than a single tweet. In this paper we evaluate different grouping schemes found in the literature and propose a new grouping scheme utilizing named entities and word collocations. Results show that our proposed scheme performs better than the existing approaches, to a some extent for all the test cases, and for both finding high-level topics and tweet labeling.","PeriodicalId":231517,"journal":{"name":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS.2015.38","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Making sense of Twitter data streams is challenging due to the extremely high volume of data. One way to address this challenge is to consider these data streams as containing a set of high-level topics. In this research we address the problem of: given a collection of tweets about K high-level topics, how to find topic words that describe these topics as well as how to label each tweet with one of the K topics using a topic modeling approach. Current research has shown that applying topic modeling algorithms directly on tweets does not lead to good results. Hence one approach is to group related tweets together, so as to form a single “pseudo-document”, which is more informative than a single tweet. In this paper we evaluate different grouping schemes found in the literature and propose a new grouping scheme utilizing named entities and word collocations. Results show that our proposed scheme performs better than the existing approaches, to a some extent for all the test cases, and for both finding high-level topics and tweet labeling.