PARMA: MapReduce中近似关联规则挖掘的并行随机化算法

Proceedings of the 21st ACM international conference on Information and knowledge management Pub Date : 2012-10-29 DOI:10.1145/2396761.2396776

Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, E. Upfal

{"title":"PARMA: MapReduce中近似关联规则挖掘的并行随机化算法","authors":"Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, E. Upfal","doi":"10.1145/2396761.2396776","DOIUrl":null,"url":null,"abstract":"Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"144","resultStr":"{\"title\":\"PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce\",\"authors\":\"Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, E. Upfal\",\"doi\":\"10.1145/2396761.2396776\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.\",\"PeriodicalId\":313414,\"journal\":{\"name\":\"Proceedings of the 21st ACM international conference on Information and knowledge management\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-10-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"144\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21st ACM international conference on Information and knowledge management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2396761.2396776\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM international conference on Information and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2396761.2396776","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 144

摘要

频繁项集和关联规则挖掘(FIM)是从数据中发现知识的关键任务。随着数据集的增长，解决此任务的成本由依赖于数据集中事务数量的组件主导。我们通过提出PARMA来解决这个问题，PARMA是MapReduce框架的一种并行算法，它可以很好地随数据集的大小(作为事务的数量)进行扩展，同时最小化数据复制和通信成本。PARMA通过对FIM使用随机抽样方法减少了与数据集大小相关的部分成本。每台机器挖掘数据集的一个小的随机样本，其大小与数据集大小无关。然后对来自每台机器的结果进行过滤和聚合，以生成单个输出集合。输出将非常接近频率项集(FI)或关联规则(AR)及其频率和置信度的集合。我们的分析从概率上保证了输出的质量在用户指定的精度和错误概率参数范围内。随机样本的大小与数据集的大小无关，样本的数量也是如此。它们取决于用户选择的精度和误差概率参数以及并行计算模型。我们在Hadoop MapReduce中实现了PARMA，并通过实验证明，在相同的平台上，它比以前引入的FIM算法运行得更快，同时1)几乎是线性扩展，2)提供比分析所保证的更高的准确性和可信度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 21st ACM international conference on Information and knowledge management

自引率

0.00%

发文量