DFPS:基于Spark的分布式fp增长算法

2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) Pub Date : 2017-03-01 DOI:10.1109/IAEAC.2017.8054308

Xiujin Shi, Shaozong Chen, Hui Yang

{"title":"DFPS:基于Spark的分布式fp增长算法","authors":"Xiujin Shi, Shaozong Chen, Hui Yang","doi":"10.1109/IAEAC.2017.8054308","DOIUrl":null,"url":null,"abstract":"Frequent Itemset Mining (FIM) is the most important and time-consuming step of association rules mining. With the increment of data scale, many efficient single-machine algorithms of FIM, such as FP-growth and Apriori, cannot accomplish the computing tasks within reasonable time. As a result of the limitation of single-machine methods, researchers presented some distributed algorithms based on MapReduce and Spark, such as PFP and YAFIM. Nevertheless, the heavy disk I/O cost at each MapReduce operation makes PFP not efficient enough. YAFIM needs to generate candidate frequent itemsets in each iterative step. It makes YAFIM time-consuming. And if the scale of data is large enough, YAFIM algorithm will not work due to the limitation of memory since the candidate frequent itemsets need to be stored in the memory. And the size of candidate itemsets is very large especially facing the massive data. In this work, we propose a distributed FP-growth algorithm based on Spark, we call it DFPS. DFPS partitions computing tasks in such a way that each computing node builds the conditional FP-tree and adopts a pattern fragment growth method to mine the frequent itemsets independently. DFPS doesn't need to pass messages between nodes during mining frequent itemsets. Our performance study shows that DFPS algorithm is more excellent than YAFIM, especially when the length of transactions is long, the number of items is large and the data is massive. And DFPS has an excellent scalability. The experimental results show that DFPS is more than 10 times faster than YAFIM for T10I4D100K dataset and Pumsb_star dataset.","PeriodicalId":432109,"journal":{"name":"2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"DFPS: Distributed FP-growth algorithm based on Spark\",\"authors\":\"Xiujin Shi, Shaozong Chen, Hui Yang\",\"doi\":\"10.1109/IAEAC.2017.8054308\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Frequent Itemset Mining (FIM) is the most important and time-consuming step of association rules mining. With the increment of data scale, many efficient single-machine algorithms of FIM, such as FP-growth and Apriori, cannot accomplish the computing tasks within reasonable time. As a result of the limitation of single-machine methods, researchers presented some distributed algorithms based on MapReduce and Spark, such as PFP and YAFIM. Nevertheless, the heavy disk I/O cost at each MapReduce operation makes PFP not efficient enough. YAFIM needs to generate candidate frequent itemsets in each iterative step. It makes YAFIM time-consuming. And if the scale of data is large enough, YAFIM algorithm will not work due to the limitation of memory since the candidate frequent itemsets need to be stored in the memory. And the size of candidate itemsets is very large especially facing the massive data. In this work, we propose a distributed FP-growth algorithm based on Spark, we call it DFPS. DFPS partitions computing tasks in such a way that each computing node builds the conditional FP-tree and adopts a pattern fragment growth method to mine the frequent itemsets independently. DFPS doesn't need to pass messages between nodes during mining frequent itemsets. Our performance study shows that DFPS algorithm is more excellent than YAFIM, especially when the length of transactions is long, the number of items is large and the data is massive. And DFPS has an excellent scalability. The experimental results show that DFPS is more than 10 times faster than YAFIM for T10I4D100K dataset and Pumsb_star dataset.\",\"PeriodicalId\":432109,\"journal\":{\"name\":\"2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IAEAC.2017.8054308\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IAEAC.2017.8054308","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

频繁项集挖掘(FIM)是关联规则挖掘中最重要也是最耗时的一步。随着数据规模的增加，许多高效的FIM单机算法如FP-growth、Apriori等无法在合理的时间内完成计算任务。由于单机方法的局限性，研究人员提出了一些基于MapReduce和Spark的分布式算法，如PFP和YAFIM。然而，每次MapReduce操作的沉重磁盘I/O成本使得PFP不够高效。YAFIM需要在每个迭代步骤中生成候选频繁项集。这使得YAFIM非常耗时。如果数据规模足够大，由于候选频繁项集需要存储在内存中，YAFIM算法将无法工作。而候选项集的大小是非常大的，特别是面对海量的数据。在这项工作中，我们提出了一个基于Spark的分布式fp增长算法，我们称之为DFPS。DFPS对计算任务进行划分，每个计算节点构建条件fp树，采用模式片段生长法独立挖掘频繁项集。在挖掘频繁项集期间，DFPS不需要在节点之间传递消息。我们的性能研究表明，DFPS算法比YAFIM更优秀，特别是在事务长度长、项目数量大、数据量大的情况下。DFPS具有很好的可扩展性。实验结果表明，对于T10I4D100K数据集和Pumsb_star数据集，DFPS比YAFIM快10倍以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DFPS: Distributed FP-growth algorithm based on Spark

Frequent Itemset Mining (FIM) is the most important and time-consuming step of association rules mining. With the increment of data scale, many efficient single-machine algorithms of FIM, such as FP-growth and Apriori, cannot accomplish the computing tasks within reasonable time. As a result of the limitation of single-machine methods, researchers presented some distributed algorithms based on MapReduce and Spark, such as PFP and YAFIM. Nevertheless, the heavy disk I/O cost at each MapReduce operation makes PFP not efficient enough. YAFIM needs to generate candidate frequent itemsets in each iterative step. It makes YAFIM time-consuming. And if the scale of data is large enough, YAFIM algorithm will not work due to the limitation of memory since the candidate frequent itemsets need to be stored in the memory. And the size of candidate itemsets is very large especially facing the massive data. In this work, we propose a distributed FP-growth algorithm based on Spark, we call it DFPS. DFPS partitions computing tasks in such a way that each computing node builds the conditional FP-tree and adopts a pattern fragment growth method to mine the frequent itemsets independently. DFPS doesn't need to pass messages between nodes during mining frequent itemsets. Our performance study shows that DFPS algorithm is more excellent than YAFIM, especially when the length of transactions is long, the number of items is large and the data is massive. And DFPS has an excellent scalability. The experimental results show that DFPS is more than 10 times faster than YAFIM for T10I4D100K dataset and Pumsb_star dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)

自引率

0.00%

发文量