{"title":"DFPS:基于Spark的分布式fp增长算法","authors":"Xiujin Shi, Shaozong Chen, Hui Yang","doi":"10.1109/IAEAC.2017.8054308","DOIUrl":null,"url":null,"abstract":"Frequent Itemset Mining (FIM) is the most important and time-consuming step of association rules mining. With the increment of data scale, many efficient single-machine algorithms of FIM, such as FP-growth and Apriori, cannot accomplish the computing tasks within reasonable time. As a result of the limitation of single-machine methods, researchers presented some distributed algorithms based on MapReduce and Spark, such as PFP and YAFIM. Nevertheless, the heavy disk I/O cost at each MapReduce operation makes PFP not efficient enough. YAFIM needs to generate candidate frequent itemsets in each iterative step. It makes YAFIM time-consuming. And if the scale of data is large enough, YAFIM algorithm will not work due to the limitation of memory since the candidate frequent itemsets need to be stored in the memory. And the size of candidate itemsets is very large especially facing the massive data. In this work, we propose a distributed FP-growth algorithm based on Spark, we call it DFPS. DFPS partitions computing tasks in such a way that each computing node builds the conditional FP-tree and adopts a pattern fragment growth method to mine the frequent itemsets independently. DFPS doesn't need to pass messages between nodes during mining frequent itemsets. Our performance study shows that DFPS algorithm is more excellent than YAFIM, especially when the length of transactions is long, the number of items is large and the data is massive. And DFPS has an excellent scalability. The experimental results show that DFPS is more than 10 times faster than YAFIM for T10I4D100K dataset and Pumsb_star dataset.","PeriodicalId":432109,"journal":{"name":"2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"DFPS: Distributed FP-growth algorithm based on Spark\",\"authors\":\"Xiujin Shi, Shaozong Chen, Hui Yang\",\"doi\":\"10.1109/IAEAC.2017.8054308\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Frequent Itemset Mining (FIM) is the most important and time-consuming step of association rules mining. With the increment of data scale, many efficient single-machine algorithms of FIM, such as FP-growth and Apriori, cannot accomplish the computing tasks within reasonable time. As a result of the limitation of single-machine methods, researchers presented some distributed algorithms based on MapReduce and Spark, such as PFP and YAFIM. Nevertheless, the heavy disk I/O cost at each MapReduce operation makes PFP not efficient enough. YAFIM needs to generate candidate frequent itemsets in each iterative step. It makes YAFIM time-consuming. And if the scale of data is large enough, YAFIM algorithm will not work due to the limitation of memory since the candidate frequent itemsets need to be stored in the memory. And the size of candidate itemsets is very large especially facing the massive data. In this work, we propose a distributed FP-growth algorithm based on Spark, we call it DFPS. DFPS partitions computing tasks in such a way that each computing node builds the conditional FP-tree and adopts a pattern fragment growth method to mine the frequent itemsets independently. DFPS doesn't need to pass messages between nodes during mining frequent itemsets. Our performance study shows that DFPS algorithm is more excellent than YAFIM, especially when the length of transactions is long, the number of items is large and the data is massive. And DFPS has an excellent scalability. The experimental results show that DFPS is more than 10 times faster than YAFIM for T10I4D100K dataset and Pumsb_star dataset.\",\"PeriodicalId\":432109,\"journal\":{\"name\":\"2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IAEAC.2017.8054308\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IAEAC.2017.8054308","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DFPS: Distributed FP-growth algorithm based on Spark
Frequent Itemset Mining (FIM) is the most important and time-consuming step of association rules mining. With the increment of data scale, many efficient single-machine algorithms of FIM, such as FP-growth and Apriori, cannot accomplish the computing tasks within reasonable time. As a result of the limitation of single-machine methods, researchers presented some distributed algorithms based on MapReduce and Spark, such as PFP and YAFIM. Nevertheless, the heavy disk I/O cost at each MapReduce operation makes PFP not efficient enough. YAFIM needs to generate candidate frequent itemsets in each iterative step. It makes YAFIM time-consuming. And if the scale of data is large enough, YAFIM algorithm will not work due to the limitation of memory since the candidate frequent itemsets need to be stored in the memory. And the size of candidate itemsets is very large especially facing the massive data. In this work, we propose a distributed FP-growth algorithm based on Spark, we call it DFPS. DFPS partitions computing tasks in such a way that each computing node builds the conditional FP-tree and adopts a pattern fragment growth method to mine the frequent itemsets independently. DFPS doesn't need to pass messages between nodes during mining frequent itemsets. Our performance study shows that DFPS algorithm is more excellent than YAFIM, especially when the length of transactions is long, the number of items is large and the data is massive. And DFPS has an excellent scalability. The experimental results show that DFPS is more than 10 times faster than YAFIM for T10I4D100K dataset and Pumsb_star dataset.