A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction Stream

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-07-21 DOI:10.1145/3465238

Guangtao Wang, G. Cong, Ying Zhang, Zhen Hai, Jieping Ye

{"title":"A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction Stream","authors":"Guangtao Wang, G. Cong, Ying Zhang, Zhen Hai, Jieping Ye","doi":"10.1145/3465238","DOIUrl":null,"url":null,"abstract":"The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k-Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data (TKDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3465238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k-Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators.

查看原文本刊更多论文

一种基于概要的海量多事务流项集频率估计方法

多个事务与同一个键相关联的流在实践中很普遍，例如，一个客户有多个在不同时间到达的购物记录。由于不能使用基于采样的方法，例如常用的储层采样，因此对此类流的项集频率估计非常具有挑战性。在本文中，我们提出了一种新的基于k-最小值(KMV)概要的方法来估计多事务流上项目集的频率。首先，我们从流中提取每个项目的KMV概要。然后，我们提出了一种新的估计器来估计项目集在KMV集上的频率。与现有的估计方法相比，该方法不仅计算精度高，效率高，而且遵循下闭包特性。这些属性使我们的新估计器与现有的频繁项集挖掘(FIM)算法(例如，FP-Growth)相结合，可以在多事务流上挖掘频繁项集。为了证明这一点，我们将我们的估计器集成到现有的FIM算法中，实现了一个基于KMV概要的FIM算法，并证明了它能够在KMV概要有界的情况下保证FIM的准确性。在海量流上的实验结果表明，与现有的估计器相比，我们的估计器在估计项目集频率和FIM方面都能显著提高精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Knowledge Discovery from Data (TKDD)

自引率

0.00%

发文量