Spark based Parallel Frequent Pattern Rules for Social Media Data Analytics

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW) Pub Date : 2023-05-01 DOI:10.1109/CCGridW59191.2023.00039

Shubhangi Chaturvedi, S. Saritha, Animesh Chaturvedi

{"title":"Spark based Parallel Frequent Pattern Rules for Social Media Data Analytics","authors":"Shubhangi Chaturvedi, S. Saritha, Animesh Chaturvedi","doi":"10.1109/CCGridW59191.2023.00039","DOIUrl":null,"url":null,"abstract":"The number of users on social media are increasing, thus the data produced is also increasing tremendously. Social media data mining and analysis can produce a lot of hidden information, which can be helpful in decision-making. Prediction of the co-occurring words with confidence can provide deep insights of social media. The paper presents an applied process to mine social media dataset to retrieve frequent patterns (or rules) in cost effective time. The retrieved patterns can be useful in making decisions related to social media. The experiment is performed on three social media datasets and various rules are analyzed by varying the values of threshold (minimum support and minimum confidence). Experiments are also performed for both Frequent Pattern (FP) Growth and Parallel FP (PFP) Growth using the same datasets. The parallel computation is achieved with the help of a scalable Apache Spark environment. Execution time for both FP-Growth and PFP-Growth on the same datasets is also described. While performing experiments it is found that FP-Growth of SPMF requires preprocessing to convert item-sets into transactional databases. The pre-processing time is required only once, as a result the time required to generate rules is less. Whereas, the PFP-Growth does not require preprocessing on the dataset to generate rules. This saves time to directly generate the association rules using PFP-Growth.","PeriodicalId":341115,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGridW59191.2023.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The number of users on social media are increasing, thus the data produced is also increasing tremendously. Social media data mining and analysis can produce a lot of hidden information, which can be helpful in decision-making. Prediction of the co-occurring words with confidence can provide deep insights of social media. The paper presents an applied process to mine social media dataset to retrieve frequent patterns (or rules) in cost effective time. The retrieved patterns can be useful in making decisions related to social media. The experiment is performed on three social media datasets and various rules are analyzed by varying the values of threshold (minimum support and minimum confidence). Experiments are also performed for both Frequent Pattern (FP) Growth and Parallel FP (PFP) Growth using the same datasets. The parallel computation is achieved with the help of a scalable Apache Spark environment. Execution time for both FP-Growth and PFP-Growth on the same datasets is also described. While performing experiments it is found that FP-Growth of SPMF requires preprocessing to convert item-sets into transactional databases. The pre-processing time is required only once, as a result the time required to generate rules is less. Whereas, the PFP-Growth does not require preprocessing on the dataset to generate rules. This saves time to directly generate the association rules using PFP-Growth.

查看原文本刊更多论文

基于Spark的社交媒体数据分析并行频繁模式规则

社交媒体的用户数量在不断增加，因此产生的数据也在急剧增加。社交媒体数据挖掘和分析可以产生大量的隐藏信息，这些信息可以帮助决策。有信心地预测共现词可以提供对社交媒体的深刻洞察。本文提出了一种挖掘社交媒体数据集的应用流程，以在经济有效的时间内检索频繁模式(或规则)。检索到的模式在做出与社交媒体相关的决策时很有用。实验在三个社交媒体数据集上进行，通过改变阈值(最小支持度和最小置信度)来分析各种规则。实验也进行了频繁模式(FP)生长和并行FP (PFP)生长使用相同的数据集。并行计算是在可扩展的Apache Spark环境的帮助下实现的。还描述了FP-Growth和PFP-Growth在相同数据集上的执行时间。在进行实验时发现，SPMF的FP-Growth需要预处理才能将项目集转换为事务数据库。预处理时间只需要一次，因此生成规则所需的时间更少。然而，PFP-Growth不需要对数据集进行预处理来生成规则。这样可以节省使用PFP-Growth直接生成关联规则的时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)

自引率

0.00%

发文量