{"title":"Evaluation of retweet clustering method classification method using retweets on Twitter without text data","authors":"Kazuki Uchida, F. Toriumi, Takeshi Sakaki","doi":"10.1145/3106426.3106451","DOIUrl":null,"url":null,"abstract":"Burst phenomena, which frequently occur on social media, are caused by such social events as flaming on the internet, elections, and natural disasters. To understand people's thoughts and feelings, we must classify their opinions from burst phenomena. Therefore, classification methods that categorize tweets are critical. However, since most classification methods focus on text mining, they cannot group tweets by topics because each tweet has poor linguistic similarities. We used a non-text-based classification method proposed by Baba et al. that groups tweets by topics, even if they have poor linguistic similarities, and verified its validity by comparing it with a text-based classification method in two different evaluations: qualitative and quantitative. In the qualitative evaluation part, we did a questionnaire survey and validated the suitability of the topic clusters created using both the non-and text-based methods. Since evaluating the similarity of every pair of tweets in each topic is difficult, we evaluated the similarity between sampled pairs in the survey and acquired more appropriate topic clustering results using the non-text-based method than the text-based method. In the quantitative evaluation part, we focused on the robustness of each method against data reduction. Many approaches analyze social media data, especially because collecting data from social media is comparatively easy. However, since collecting the whole data of burst phenomena is very costly due to the vast amounts of available social media data, robustness against data reduction is an important index to evaluate classification methods. With the non-text-based method, over 55% of the pairs of tweets in the same cluster were also included in the same cluster even when the data were reduced to 10% in all three of our example cases. In this paper, as a source we focus on Twitter, one of the most popular microblogging services. Using clustering to conduct detailed case analyses, we scrutinized three burst cases that include natural disasters and flaming on the internet and found that a non-text-based method more effectively classified tweets in burst phenomena than a text-based method.","PeriodicalId":20685,"journal":{"name":"Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3106426.3106451","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Burst phenomena, which frequently occur on social media, are caused by such social events as flaming on the internet, elections, and natural disasters. To understand people's thoughts and feelings, we must classify their opinions from burst phenomena. Therefore, classification methods that categorize tweets are critical. However, since most classification methods focus on text mining, they cannot group tweets by topics because each tweet has poor linguistic similarities. We used a non-text-based classification method proposed by Baba et al. that groups tweets by topics, even if they have poor linguistic similarities, and verified its validity by comparing it with a text-based classification method in two different evaluations: qualitative and quantitative. In the qualitative evaluation part, we did a questionnaire survey and validated the suitability of the topic clusters created using both the non-and text-based methods. Since evaluating the similarity of every pair of tweets in each topic is difficult, we evaluated the similarity between sampled pairs in the survey and acquired more appropriate topic clustering results using the non-text-based method than the text-based method. In the quantitative evaluation part, we focused on the robustness of each method against data reduction. Many approaches analyze social media data, especially because collecting data from social media is comparatively easy. However, since collecting the whole data of burst phenomena is very costly due to the vast amounts of available social media data, robustness against data reduction is an important index to evaluate classification methods. With the non-text-based method, over 55% of the pairs of tweets in the same cluster were also included in the same cluster even when the data were reduced to 10% in all three of our example cases. In this paper, as a source we focus on Twitter, one of the most popular microblogging services. Using clustering to conduct detailed case analyses, we scrutinized three burst cases that include natural disasters and flaming on the internet and found that a non-text-based method more effectively classified tweets in burst phenomena than a text-based method.