基于MapReduce的大数据频繁项集挖掘算法

IF 0.9 Q4 ENGINEERING, ELECTRICAL & ELECTRONIC

International Journal of Electrical and Computer Engineering Systems Pub Date : 2023-11-14 DOI:10.32985/ijeces.14.9.9

Borra Sivaiah, Ramisetty Rajeswara Rao

{"title":"基于MapReduce的大数据频繁项集挖掘算法","authors":"Borra Sivaiah, Ramisetty Rajeswara Rao","doi":"10.32985/ijeces.14.9.9","DOIUrl":null,"url":null,"abstract":"Due to the rapid growth of data from different sources in organizations, the traditional tools and techniques that cannot handle such huge data are known as big data which is in a scalable fashion. Similarly, many existing frequent itemset mining algorithms have good performance but scalability problems as they cannot exploit parallel processing power available locally or in cloud infrastructure. Since big data and cloud ecosystem overcomes the barriers or limitations in computing resources, it is a natural choice to use distributed programming paradigms such as Map Reduce. In this paper, we propose a novel algorithm known as A Nodesets-based Fast and Scalable Frequent Itemset Mining (FSFIM) to extract frequent itemsets from Big Data. Here, Pre-Order Coding (POC) tree is used to represent data and improve speed in processing. Nodeset is the underlying data structure that is efficient in discovering frequent itemsets. FSFIM is found to be faster and more scalable in mining frequent itemsets. When compared with its predecessors such as Node-lists and N-lists, the Nodesets save half of the memory as they need only either pre-order or post-order coding. Cloudera's Distribution of Hadoop (CDH), a MapReduce framework, is used for empirical study. A prototype application is built to evaluate the performance of the FSFIM. Experimental results revealed that FSFIM outperforms existing algorithms such as Mahout PFP, Mlib PFP, and Big FIM. FSFIM is more scalable and found to be an ideal candidate for real-time applications that mine frequent itemsets from Big Data.","PeriodicalId":41912,"journal":{"name":"International Journal of Electrical and Computer Engineering Systems","volume":"20 7","pages":"0"},"PeriodicalIF":0.9000,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Novel Nodesets-Based Frequent Itemset Mining Algorithm for Big Data using MapReduce\",\"authors\":\"Borra Sivaiah, Ramisetty Rajeswara Rao\",\"doi\":\"10.32985/ijeces.14.9.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to the rapid growth of data from different sources in organizations, the traditional tools and techniques that cannot handle such huge data are known as big data which is in a scalable fashion. Similarly, many existing frequent itemset mining algorithms have good performance but scalability problems as they cannot exploit parallel processing power available locally or in cloud infrastructure. Since big data and cloud ecosystem overcomes the barriers or limitations in computing resources, it is a natural choice to use distributed programming paradigms such as Map Reduce. In this paper, we propose a novel algorithm known as A Nodesets-based Fast and Scalable Frequent Itemset Mining (FSFIM) to extract frequent itemsets from Big Data. Here, Pre-Order Coding (POC) tree is used to represent data and improve speed in processing. Nodeset is the underlying data structure that is efficient in discovering frequent itemsets. FSFIM is found to be faster and more scalable in mining frequent itemsets. When compared with its predecessors such as Node-lists and N-lists, the Nodesets save half of the memory as they need only either pre-order or post-order coding. Cloudera's Distribution of Hadoop (CDH), a MapReduce framework, is used for empirical study. A prototype application is built to evaluate the performance of the FSFIM. Experimental results revealed that FSFIM outperforms existing algorithms such as Mahout PFP, Mlib PFP, and Big FIM. FSFIM is more scalable and found to be an ideal candidate for real-time applications that mine frequent itemsets from Big Data.\",\"PeriodicalId\":41912,\"journal\":{\"name\":\"International Journal of Electrical and Computer Engineering Systems\",\"volume\":\"20 7\",\"pages\":\"0\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2023-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Electrical and Computer Engineering Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32985/ijeces.14.9.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Electrical and Computer Engineering Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32985/ijeces.14.9.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

由于组织中来自不同来源的数据的快速增长，传统的工具和技术无法处理如此庞大的数据，因此被称为可扩展的大数据。同样，许多现有的频繁项集挖掘算法具有良好的性能，但存在可扩展性问题，因为它们无法利用本地或云基础设施中可用的并行处理能力。由于大数据和云生态系统克服了计算资源的障碍或限制，使用Map Reduce等分布式编程范式是一种自然的选择。在本文中，我们提出了一种新的算法，称为基于节点集的快速可扩展频繁项集挖掘(FSFIM)，从大数据中提取频繁项集。本文采用预序编码(Pre-Order Coding, POC)树来表示数据，提高处理速度。节点集是一种底层数据结构，可以有效地发现频繁的项目集。发现FSFIM在挖掘频繁项集方面速度更快，更具可扩展性。与node -list和n -list等前辈相比，node - sets节省了一半的内存，因为它们只需要预先排序或后顺序编码。使用Cloudera的分布式Hadoop (CDH)作为MapReduce框架进行实证研究。建立了一个原型应用程序来评估FSFIM的性能。实验结果表明，FSFIM优于现有的Mahout PFP、Mlib PFP和Big FIM算法。FSFIM具有更高的可扩展性，是从大数据中挖掘频繁项目集的实时应用程序的理想选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Novel Nodesets-Based Frequent Itemset Mining Algorithm for Big Data using MapReduce

Due to the rapid growth of data from different sources in organizations, the traditional tools and techniques that cannot handle such huge data are known as big data which is in a scalable fashion. Similarly, many existing frequent itemset mining algorithms have good performance but scalability problems as they cannot exploit parallel processing power available locally or in cloud infrastructure. Since big data and cloud ecosystem overcomes the barriers or limitations in computing resources, it is a natural choice to use distributed programming paradigms such as Map Reduce. In this paper, we propose a novel algorithm known as A Nodesets-based Fast and Scalable Frequent Itemset Mining (FSFIM) to extract frequent itemsets from Big Data. Here, Pre-Order Coding (POC) tree is used to represent data and improve speed in processing. Nodeset is the underlying data structure that is efficient in discovering frequent itemsets. FSFIM is found to be faster and more scalable in mining frequent itemsets. When compared with its predecessors such as Node-lists and N-lists, the Nodesets save half of the memory as they need only either pre-order or post-order coding. Cloudera's Distribution of Hadoop (CDH), a MapReduce framework, is used for empirical study. A prototype application is built to evaluate the performance of the FSFIM. Experimental results revealed that FSFIM outperforms existing algorithms such as Mahout PFP, Mlib PFP, and Big FIM. FSFIM is more scalable and found to be an ideal candidate for real-time applications that mine frequent itemsets from Big Data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Electrical and Computer Engineering Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

1.20

自引率

11.80%

发文量

期刊介绍： The International Journal of Electrical and Computer Engineering Systems publishes original research in the form of full papers, case studies, reviews and surveys. It covers theory and application of electrical and computer engineering, synergy of computer systems and computational methods with electrical and electronic systems, as well as interdisciplinary research. Power systems Renewable electricity production Power electronics Electrical drives Industrial electronics Communication systems Advanced modulation techniques RFID devices and systems Signal and data processing Image processing Multimedia systems Microelectronics Instrumentation and measurement Control systems Robotics Modeling and simulation Modern computer architectures Computer networks Embedded systems High-performance computing Engineering education Parallel and distributed computer systems Human-computer systems Intelligent systems Multi-agent and holonic systems Real-time systems Software engineering Internet and web applications and systems Applications of computer systems in engineering and related disciplines Mathematical models of engineering systems Engineering management.