Social media analysis using optimized K-Means clustering

2016 IEEE 14th International Conference on Software Engineering Research, Management and Applications (SERA) Pub Date : 2016-06-08 DOI:10.1109/SERA.2016.7516129

Ahmed Alsayat, H. El-Sayed

{"title":"Social media analysis using optimized K-Means clustering","authors":"Ahmed Alsayat, H. El-Sayed","doi":"10.1109/SERA.2016.7516129","DOIUrl":null,"url":null,"abstract":"The increasing influence of social media and enormous participation of users creates new opportunities to study human social behavior along with the capability to analyze large amount of data streams. One of the interesting problems is to distinguish between different kinds of users, for example users who are leaders and introduce new issues and discussions on social media. Furthermore, positive or negative attitudes can also be inferred from those discussions. Such problems require a formal interpretation of social media logs and unit of information that can spread from person to person through the social network. Once the social media data such as user messages are parsed and network relationships are identified, data mining techniques can be applied to group different types of communities. However, the appropriate granularity of user communities and their behavior is hardly captured by existing methods. In this paper, we present a framework for the novel task of detecting communities by clustering messages from large streams of social data. Our framework uses K-Means clustering algorithm along with Genetic algorithm and Optimized Cluster Distance (OCD) method to cluster data. The goal of our proposed framework is twofold that is to overcome the problem of general K-Means for choosing best initial centroids using Genetic algorithm, as well as to maximize the distance between clusters by pairwise clustering using OCD to get an accurate clusters. We used various cluster validation metrics to evaluate the performance of our algorithm. The analysis shows that the proposed method gives better clustering results and provides a novel use-case of grouping user communities based on their activities. Our approach is optimized and scalable for real-time clustering of social media data.","PeriodicalId":412361,"journal":{"name":"2016 IEEE 14th International Conference on Software Engineering Research, Management and Applications (SERA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 14th International Conference on Software Engineering Research, Management and Applications (SERA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SERA.2016.7516129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

The increasing influence of social media and enormous participation of users creates new opportunities to study human social behavior along with the capability to analyze large amount of data streams. One of the interesting problems is to distinguish between different kinds of users, for example users who are leaders and introduce new issues and discussions on social media. Furthermore, positive or negative attitudes can also be inferred from those discussions. Such problems require a formal interpretation of social media logs and unit of information that can spread from person to person through the social network. Once the social media data such as user messages are parsed and network relationships are identified, data mining techniques can be applied to group different types of communities. However, the appropriate granularity of user communities and their behavior is hardly captured by existing methods. In this paper, we present a framework for the novel task of detecting communities by clustering messages from large streams of social data. Our framework uses K-Means clustering algorithm along with Genetic algorithm and Optimized Cluster Distance (OCD) method to cluster data. The goal of our proposed framework is twofold that is to overcome the problem of general K-Means for choosing best initial centroids using Genetic algorithm, as well as to maximize the distance between clusters by pairwise clustering using OCD to get an accurate clusters. We used various cluster validation metrics to evaluate the performance of our algorithm. The analysis shows that the proposed method gives better clustering results and provides a novel use-case of grouping user communities based on their activities. Our approach is optimized and scalable for real-time clustering of social media data.

查看原文本刊更多论文

使用优化K-Means聚类的社交媒体分析

社交媒体的影响力越来越大，用户的大量参与为研究人类社会行为以及分析大量数据流的能力创造了新的机会。其中一个有趣的问题是区分不同类型的用户，例如谁是领导者，谁是在社交媒体上引入新问题和讨论的用户。此外，从这些讨论中也可以推断出积极或消极的态度。此类问题需要对社交媒体日志和可以通过社交网络在人与人之间传播的信息单元进行正式解释。一旦对用户消息等社交媒体数据进行了解析，并确定了网络关系，就可以将数据挖掘技术应用于对不同类型的社区进行分组。然而，现有的方法很难捕捉到用户群体及其行为的适当粒度。在本文中，我们提出了一个框架，用于通过从大型社交数据流中聚类消息来检测社区的新任务。我们的框架使用K-Means聚类算法、遗传算法和优化聚类距离(OCD)方法对数据进行聚类。我们提出的框架的目标有两个，一是克服了使用遗传算法选择最佳初始质心的一般K-Means问题，二是使用OCD通过两两聚类来最大化聚类之间的距离以获得准确的聚类。我们使用各种集群验证指标来评估算法的性能。分析表明，该方法具有较好的聚类效果，并为基于用户活动对用户群体进行分组提供了一种新的用例。我们的方法针对社交媒体数据的实时集群进行了优化和可扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE 14th International Conference on Software Engineering Research, Management and Applications (SERA)

自引率

0.00%

发文量