An efficient framework of data mining and its analytics on massive streams of big data repositories

2016 IEEE Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER) Pub Date : 2016-08-01 DOI:10.1109/DISCOVER.2016.7806259

D. Disha, B. J. Sowmya, Chetan, S. Seema

{"title":"An efficient framework of data mining and its analytics on massive streams of big data repositories","authors":"D. Disha, B. J. Sowmya, Chetan, S. Seema","doi":"10.1109/DISCOVER.2016.7806259","DOIUrl":null,"url":null,"abstract":"Big Data consists of huge volume of complex growing data sets from several independent sources. With the rapid development of data collection and storage capacity, big data are expanding in all science and engineering domains. The most fundamental challenge for big data applications is to scrutinize the large amount of data and extract required information or knowledge for future usage which is beyond the limit of relational databases with respect to storage and processing of massive quantity of data. Intent of this paper is by considering the big data repository as Twitter, dynamically mine the recent tweets related to Kapoor and Sons movie and perform the data mining operation and analytics on it by overcoming the challenges categorized with respect to the HACE theorem. To handle the massive amount of tweets we have used Hadoop Map Reduce framework to perform data mining analytic operations such as data cleansing, data classification and data clustering. Prediction model for the movie review is built by using Naive Bayes Classifier and accuracy of the prediction is calculated with the help of binomial test as it conforms to the Bernoulli distribution. Clustering of Tweets are obtained on the basis of Location and Hash Tags. Additionally privacy for the user tweets are preserved by using Data mining anomaly Technique, results are displayed with the help of intelligent graphs. As a Performance Evaluation of Map Reduce, the predictive analysis is done by using Map Reduce as well as without using Map Reduce, based on the execution time comparison performance graph is obtained to prove Map Reduce is an Efficient framework for huge volume of data.","PeriodicalId":383554,"journal":{"name":"2016 IEEE Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DISCOVER.2016.7806259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Big Data consists of huge volume of complex growing data sets from several independent sources. With the rapid development of data collection and storage capacity, big data are expanding in all science and engineering domains. The most fundamental challenge for big data applications is to scrutinize the large amount of data and extract required information or knowledge for future usage which is beyond the limit of relational databases with respect to storage and processing of massive quantity of data. Intent of this paper is by considering the big data repository as Twitter, dynamically mine the recent tweets related to Kapoor and Sons movie and perform the data mining operation and analytics on it by overcoming the challenges categorized with respect to the HACE theorem. To handle the massive amount of tweets we have used Hadoop Map Reduce framework to perform data mining analytic operations such as data cleansing, data classification and data clustering. Prediction model for the movie review is built by using Naive Bayes Classifier and accuracy of the prediction is calculated with the help of binomial test as it conforms to the Bernoulli distribution. Clustering of Tweets are obtained on the basis of Location and Hash Tags. Additionally privacy for the user tweets are preserved by using Data mining anomaly Technique, results are displayed with the help of intelligent graphs. As a Performance Evaluation of Map Reduce, the predictive analysis is done by using Map Reduce as well as without using Map Reduce, based on the execution time comparison performance graph is obtained to prove Map Reduce is an Efficient framework for huge volume of data.

查看原文本刊更多论文

一个有效的数据挖掘框架及其对大数据存储库的海量流的分析

大数据由来自几个独立来源的大量复杂的不断增长的数据集组成。随着数据采集和存储能力的快速发展，大数据在各个科学和工程领域得到扩展。大数据应用面临的最根本的挑战是如何对海量数据进行仔细检查，并从中提取出所需的信息或知识供将来使用，这超出了关系型数据库对海量数据的存储和处理能力的限制。本文的目的是通过将大数据存储库视为Twitter，动态挖掘与Kapoor and Sons电影相关的最新推文，并通过克服与HACE定理相关的挑战对其进行数据挖掘操作和分析。为了处理大量的推文，我们使用Hadoop Map Reduce框架来执行数据挖掘分析操作，如数据清理、数据分类和数据聚类。利用朴素贝叶斯分类器建立影评预测模型，由于其符合伯努利分布，利用二项检验计算预测精度。基于位置和哈希标签对tweet进行聚类。此外，利用数据挖掘异常技术保护用户推文的隐私，并借助智能图显示结果。作为Map Reduce的一种性能评价，通过使用Map Reduce和不使用Map Reduce进行预测分析，基于执行时间对比得出性能图，证明Map Reduce是一种处理海量数据的高效框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER)

自引率

0.00%

发文量