Evaluation of retweet clustering method classification method using retweets on Twitter without text data

Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics Pub Date : 2017-08-23 DOI:10.1145/3106426.3106451

Kazuki Uchida, F. Toriumi, Takeshi Sakaki

{"title":"Evaluation of retweet clustering method classification method using retweets on Twitter without text data","authors":"Kazuki Uchida, F. Toriumi, Takeshi Sakaki","doi":"10.1145/3106426.3106451","DOIUrl":null,"url":null,"abstract":"Burst phenomena, which frequently occur on social media, are caused by such social events as flaming on the internet, elections, and natural disasters. To understand people's thoughts and feelings, we must classify their opinions from burst phenomena. Therefore, classification methods that categorize tweets are critical. However, since most classification methods focus on text mining, they cannot group tweets by topics because each tweet has poor linguistic similarities. We used a non-text-based classification method proposed by Baba et al. that groups tweets by topics, even if they have poor linguistic similarities, and verified its validity by comparing it with a text-based classification method in two different evaluations: qualitative and quantitative. In the qualitative evaluation part, we did a questionnaire survey and validated the suitability of the topic clusters created using both the non-and text-based methods. Since evaluating the similarity of every pair of tweets in each topic is difficult, we evaluated the similarity between sampled pairs in the survey and acquired more appropriate topic clustering results using the non-text-based method than the text-based method. In the quantitative evaluation part, we focused on the robustness of each method against data reduction. Many approaches analyze social media data, especially because collecting data from social media is comparatively easy. However, since collecting the whole data of burst phenomena is very costly due to the vast amounts of available social media data, robustness against data reduction is an important index to evaluate classification methods. With the non-text-based method, over 55% of the pairs of tweets in the same cluster were also included in the same cluster even when the data were reduced to 10% in all three of our example cases. In this paper, as a source we focus on Twitter, one of the most popular microblogging services. Using clustering to conduct detailed case analyses, we scrutinized three burst cases that include natural disasters and flaming on the internet and found that a non-text-based method more effectively classified tweets in burst phenomena than a text-based method.","PeriodicalId":20685,"journal":{"name":"Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3106426.3106451","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Burst phenomena, which frequently occur on social media, are caused by such social events as flaming on the internet, elections, and natural disasters. To understand people's thoughts and feelings, we must classify their opinions from burst phenomena. Therefore, classification methods that categorize tweets are critical. However, since most classification methods focus on text mining, they cannot group tweets by topics because each tweet has poor linguistic similarities. We used a non-text-based classification method proposed by Baba et al. that groups tweets by topics, even if they have poor linguistic similarities, and verified its validity by comparing it with a text-based classification method in two different evaluations: qualitative and quantitative. In the qualitative evaluation part, we did a questionnaire survey and validated the suitability of the topic clusters created using both the non-and text-based methods. Since evaluating the similarity of every pair of tweets in each topic is difficult, we evaluated the similarity between sampled pairs in the survey and acquired more appropriate topic clustering results using the non-text-based method than the text-based method. In the quantitative evaluation part, we focused on the robustness of each method against data reduction. Many approaches analyze social media data, especially because collecting data from social media is comparatively easy. However, since collecting the whole data of burst phenomena is very costly due to the vast amounts of available social media data, robustness against data reduction is an important index to evaluate classification methods. With the non-text-based method, over 55% of the pairs of tweets in the same cluster were also included in the same cluster even when the data were reduced to 10% in all three of our example cases. In this paper, as a source we focus on Twitter, one of the most popular microblogging services. Using clustering to conduct detailed case analyses, we scrutinized three burst cases that include natural disasters and flaming on the internet and found that a non-text-based method more effectively classified tweets in burst phenomena than a text-based method.

查看原文本刊更多论文

评价转发聚类方法的分类方法，使用Twitter上无文本数据的转发

社交媒体上经常出现的突发现象，是由网络上的火情、选举、自然灾害等社会事件引起的。要了解人们的思想和感情，必须把他们的意见从突发现象中分类出来。因此，对tweet进行分类的分类方法至关重要。然而，由于大多数分类方法侧重于文本挖掘，它们不能按主题对tweet进行分组，因为每个tweet具有较差的语言相似性。我们使用了Baba等人提出的一种非基于文本的分类方法，该方法按主题对tweet进行分组，即使它们的语言相似性很差，并通过将其与基于文本的分类方法在定性和定量两种不同的评估中进行比较来验证其有效性。在定性评价部分，我们进行了问卷调查，验证了使用非基于文本的方法和基于文本的方法创建的主题聚类的适用性。由于很难评估每个主题中每对推文的相似性，我们在调查中评估了采样对之间的相似性，使用非基于文本的方法获得了比基于文本的方法更合适的主题聚类结果。在定量评估部分，我们重点关注了每种方法对数据约简的鲁棒性。许多方法分析社交媒体数据，特别是因为从社交媒体收集数据相对容易。然而，由于大量可用的社交媒体数据，收集突发现象的全部数据是非常昂贵的，因此对数据约简的鲁棒性是评估分类方法的重要指标。使用非基于文本的方法，即使在我们的三个示例案例中数据减少到10%，同一集群中超过55%的tweet对也被包含在同一集群中。在本文中，作为一个来源，我们将重点放在Twitter上，这是最受欢迎的微博服务之一。使用聚类进行详细的案例分析，我们仔细研究了三个突发事件，包括自然灾害和互联网上的火焰，发现非基于文本的方法比基于文本的方法更有效地分类突发事件中的推文。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics

自引率

0.00%

发文量