ACM Transactions on Knowledge Discovery from Data (TKDD)最新文献_第7页

Hybrid Variational Autoencoder for Recommender Systems 用于推荐系统的混合变分自编码器

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-09-04 DOI: 10.1145/3470659

Hangbin Zhang, R. Wong, Victor W. Chu

{"title":"Hybrid Variational Autoencoder for Recommender Systems","authors":"Hangbin Zhang, R. Wong, Victor W. Chu","doi":"10.1145/3470659","DOIUrl":"https://doi.org/10.1145/3470659","url":null,"abstract":"E-commerce platforms heavily rely on automatic personalized recommender systems, e.g., collaborative filtering models, to improve customer experience. Some hybrid models have been proposed recently to address the deficiency of existing models. However, their performances drop significantly when the dataset is sparse. Most of the recent works failed to fully address this shortcoming. At most, some of them only tried to alleviate the problem by considering either user side or item side content information. In this article, we propose a novel recommender model called Hybrid Variational Autoencoder (HVAE) to improve the performance on sparse datasets. Different from the existing approaches, we encode both user and item information into a latent space for semantic relevance measurement. In parallel, we utilize collaborative filtering to find the implicit factors of users and items, and combine their outputs to deliver a hybrid solution. In addition, we compare the performance of Gaussian distribution and multinomial distribution in learning the representations of the textual data. Our experiment results show that HVAE is able to significantly outperform state-of-the-art models with robust performance.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123932952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Assessing Large-Scale Power Relations among Locations from Mobility Data 从移动数据评估地点间大规模权力关系

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-09-04 DOI: 10.1145/3470770

L. S. Oliveira, P. V. D. Melo, A. C. Viana

{"title":"Assessing Large-Scale Power Relations among Locations from Mobility Data","authors":"L. S. Oliveira, P. V. D. Melo, A. C. Viana","doi":"10.1145/3470770","DOIUrl":"https://doi.org/10.1145/3470770","url":null,"abstract":"The pervasiveness of smartphones has shaped our lives, social norms, and the structure that dictates human behavior. They now directly influence how individuals demand resources or interact with network services. From this scenario, identifying key locations in cities is fundamental for the investigation of human mobility and also for the understanding of social problems. In this context, we propose the first graph-based methodology in the literature to quantify the power of Point-of-Interests (POIs) over its vicinity by means of user mobility trajectories. Different from literature, we consider the flow of people in our analysis, instead of the number of neighbor POIs or their structural locations in the city. Thus, we modeled POI’s visits using the multiflow graph model where each POI is a node and the transitions of users among POIs are a weighted direct edge. Using this multiflow graph model, we compute the attract, support, and independence powers. The attract power and support power measure how many visits a POI gathers from and disseminate over its neighborhood, respectively. Moreover, the independence power captures the capacity of a POI to receive visitors independently from other POIs. We tested our methodology on well-known university campus mobility datasets and validated on Location-Based Social Networks (LBSNs) datasets from various cities around the world. Our findings show that in university campus: (i) buildings have low support power and attract power; (ii) people tend to move over a few buildings and spend most of their time in the same building; and (iii) there is a slight dependence among buildings, even those with high independence power receive user visits from other buildings on campus. Globally, we reveal that (i) our metrics capture places that impact the number of visits in their neighborhood; (ii) cities in the same continent have similar independence patterns; and (iii) places with a high number of visitation and city central areas are the regions with the highest degree of independence.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122256362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Streaming Data Preprocessing via Online Tensor Recovery for Large Environmental Sensor Networks 基于在线张量恢复的大型环境传感器网络流数据预处理

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-09-01 DOI: 10.1145/3532189

Yue Hu, Ao Qu, Yanbing Wang, D. Work

{"title":"Streaming Data Preprocessing via Online Tensor Recovery for Large Environmental Sensor Networks","authors":"Yue Hu, Ao Qu, Yanbing Wang, D. Work","doi":"10.1145/3532189","DOIUrl":"https://doi.org/10.1145/3532189","url":null,"abstract":"Measuring the built and natural environment at a fine-grained scale is now possible with low-cost urban environmental sensor networks. However, fine-grained city-scale data analysis is complicated by tedious data cleaning including removing outliers and imputing missing data. While many methods exist to automatically correct anomalies and impute missing entries, challenges still exist on data with large spatial-temporal scales and shifting patterns. To address these challenges, we propose an online robust tensor recovery (OLRTR) method to preprocess streaming high-dimensional urban environmental datasets. A small-sized dictionary that captures the underlying patterns of the data is computed and constantly updated with new data. OLRTR enables online recovery for large-scale sensor networks that provide continuous data streams, with a lower computational memory usage compared to offline batch counterparts. In addition, we formulate the objective function so that OLRTR can detect structured outliers, such as faulty readings over a long period of time. We validate OLRTR on a synthetically degraded National Oceanic and Atmospheric Administration temperature dataset, and apply it to the Array of Things city-scale sensor network in Chicago, IL, showing superior results compared with several established online and batch-based low-rank decomposition methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116547537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Self-Supervised Transformer for Sparse and Irregularly Sampled Multivariate Clinical Time-Series 稀疏和不规则采样多变量临床时间序列的自监督变压器

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-07-29 DOI: 10.1145/3516367

Sindhu Tipirneni, C. Reddy

{"title":"Self-Supervised Transformer for Sparse and Irregularly Sampled Multivariate Clinical Time-Series","authors":"Sindhu Tipirneni, C. Reddy","doi":"10.1145/3516367","DOIUrl":"https://doi.org/10.1145/3516367","url":null,"abstract":"Multivariate time-series data are frequently observed in critical care settings and are typically characterized by sparsity (missing information) and irregular time intervals. Existing approaches for learning representations in this domain handle these challenges by either aggregation or imputation of values, which in-turn suppresses the fine-grained information and adds undesirable noise/overhead into the machine learning model. To tackle this problem, we propose a Self-supervised Transformer for Time-Series (STraTS) model, which overcomes these pitfalls by treating time-series as a set of observation triplets instead of using the standard dense matrix representation. It employs a novel Continuous Value Embedding technique to encode continuous time and variable values without the need for discretization. It is composed of a Transformer component with multi-head attention layers, which enable it to learn contextual triplet embeddings while avoiding the problems of recurrence and vanishing gradients that occur in recurrent architectures. In addition, to tackle the problem of limited availability of labeled data (which is typically observed in many healthcare applications), STraTS utilizes self-supervision by leveraging unlabeled data to learn better representations by using time-series forecasting as an auxiliary proxy task. Experiments on real-world multivariate clinical time-series benchmark datasets demonstrate that STraTS has better prediction performance than state-of-the-art methods for mortality prediction, especially when labeled data is limited. Finally, we also present an interpretable version of STraTS, which can identify important measurements in the time-series data. Our data preprocessing and model implementation codes are available at https://github.com/sindhura97/STraTS.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127928011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Establishing Smartphone User Behavior Model Based on Energy Consumption Data 基于能耗数据建立智能手机用户行为模型

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-07-21 DOI: 10.1145/3461459

M. Ding, Tianyu Wang, Xudong Wang

{"title":"Establishing Smartphone User Behavior Model Based on Energy Consumption Data","authors":"M. Ding, Tianyu Wang, Xudong Wang","doi":"10.1145/3461459","DOIUrl":"https://doi.org/10.1145/3461459","url":null,"abstract":"In smartphone data analysis, both energy consumption modeling and user behavior mining have been explored extensively, but the relationship between energy consumption and user behavior has been rarely studied. Such a relationship is explored over large-scale users in this article. Based on energy consumption data, where each users’ feature vector is represented by energy breakdown on hardware components of different apps, User Behavior Models (UBM) are established to capture user behavior patterns (i.e., app preference, usage time). The challenge lies in the high diversity of user behaviors (i.e., massive apps and usage ways), which leads to high dimension and dispersion of data. To overcome the challenge, three mechanisms are designed. First, to reduce the dimension, apps are ranked with the top ones identified as typical apps to represent all. Second, the dispersion is reduced by scaling each users’ feature vector with typical apps to unit ℓ1 norm. The scaled vector becomes Usage Pattern, while the ℓ1 norm of vector before scaling is treated as Usage Intensity. Third, the usage pattern is analyzed with a two-layer clustering approach to further reduce data dispersion. In the upper layer, each typical app is studied across its users with respect to hardware components to identify Typical Hardware Usage Patterns (THUP). In the lower layer, users are studied with respect to these THUPs to identify Typical App Usage Patterns (TAUP). The analytical results of these two layers are consolidated into Usage Pattern Models (UPM), and UBMs are finally established by a union of UPMs and Usage Intensity Distributions (UID). By carrying out experiments on energy consumption data from 18,308 distinct users over 10 days, 33 UBMs are extracted from training data. With the test data, it is proven that these UBMs cover 94% user behaviors and achieve up to 20% improvement in accuracy of energy representation, as compared with the baseline method, PCA. Besides, potential applications and implications of these UBMs are illustrated for smartphone manufacturers, app developers, network providers, and so on.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122586298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Opinion Dynamics Optimization by Varying Susceptibility to Persuasion via Non-Convex Local Search 基于非凸局部搜索的不同说服敏感性意见动态优化

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-07-21 DOI: 10.1145/3466617

Rediet Abebe, T-H. Hubert Chan, J. Kleinberg, Zhibin Liang, D. Parkes, Mauro Sozio, Charalampos E. Tsourakakis

{"title":"Opinion Dynamics Optimization by Varying Susceptibility to Persuasion via Non-Convex Local Search","authors":"Rediet Abebe, T-H. Hubert Chan, J. Kleinberg, Zhibin Liang, D. Parkes, Mauro Sozio, Charalampos E. Tsourakakis","doi":"10.1145/3466617","DOIUrl":"https://doi.org/10.1145/3466617","url":null,"abstract":"A long line of work in social psychology has studied variations in people’s susceptibility to persuasion—the extent to which they are willing to modify their opinions on a topic. This body of literature suggests an interesting perspective on theoretical models of opinion formation by interacting parties in a network: in addition to considering interventions that directly modify people’s intrinsic opinions, it is also natural to consider interventions that modify people’s susceptibility to persuasion. In this work, motivated by this fact, we propose an influence optimization problem. Specifically, we adopt a popular model for social opinion dynamics, where each agent has some fixed innate opinion, and a resistance that measures the importance it places on its innate opinion; agents influence one another’s opinions through an iterative process. Under certain conditions, this iterative process converges to some equilibrium opinion vector. For the unbudgeted variant of the problem, the goal is to modify the resistance of any number of agents (within some given range) such that the sum of the equilibrium opinions is minimized; for the budgeted variant, in addition the algorithm is given upfront a restriction on the number of agents whose resistance may be modified. We prove that the objective function is in general non-convex. Hence, formulating the problem as a convex program as in an early version of this work (Abebe et al., KDD’18) might have potential correctness issues. We instead analyze the structure of the objective function, and show that any local optimum is also a global optimum, which is somehow surprising as the objective function might not be convex. Furthermore, we combine the iterative process and the local search paradigm to design very efficient algorithms that can solve the unbudgeted variant of the problem optimally on large-scale graphs containing millions of nodes. Finally, we propose and evaluate experimentally a family of heuristics for the budgeted variant of the problem.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131354628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Dynamically Adjusting Diversity in Ensembles for the Classification of Data Streams with Concept Drift 基于概念漂移的数据流分类中的集成多样性动态调整

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-07-21 DOI: 10.1145/3466616

Juan Isidro González Hidalgo, S. G. T. C. Santos, Roberto S. M. Barros

{"title":"Dynamically Adjusting Diversity in Ensembles for the Classification of Data Streams with Concept Drift","authors":"Juan Isidro González Hidalgo, S. G. T. C. Santos, Roberto S. M. Barros","doi":"10.1145/3466616","DOIUrl":"https://doi.org/10.1145/3466616","url":null,"abstract":"A data stream can be defined as a system that continually generates a lot of data over time. Today, processing data streams requires new demands and challenging tasks in the data mining and machine learning areas. Concept Drift is a problem commonly characterized as changes in the distribution of the data within a data stream. The implementation of new methods for dealing with data streams where concept drifts occur requires algorithms that can adapt to several scenarios to improve its performance in the different experimental situations where they are tested. This research proposes a strategy for dynamic parameter adjustment in the presence of concept drifts. Parameter Estimation Procedure (PEP) is a general method proposed for dynamically adjusting parameters which is applied to the diversity parameter (λ) of several classification ensembles commonly used in the area. To this end, the proposed estimation method (PEP) was used to create Boosting-like Online Learning Ensemble with Parameter Estimation (BOLE-PE), Online AdaBoost-based M1 with Parameter Estimation (OABM1-PE), and Oza and Russell’s Online Bagging with Parameter Estimation (OzaBag-PE), based on the existing ensembles BOLE, OABM1, and OzaBag, respectively. To validate them, experiments were performed with artificial and real-world datasets using Hoeffding Tree (HT) as base classifier. The accuracy results were statistically evaluated using a variation of the Friedman test and the Nemenyi post-hoc test. The experimental results showed that the application of the dynamic estimation in the diversity parameter (λ) produced good results in most scenarios, i.e., the modified methods have improved accuracy in the experiments with both artificial and real-world datasets.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129065230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

High-Value Token-Blocking: Efficient Blocking Method for Record Linkage 高值令牌阻塞:记录链接的有效阻塞方法

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-07-21 DOI: 10.1145/3450527

K. O'Hare, Anna Jurek-Loughrey, Cassio P. de Campos

{"title":"High-Value Token-Blocking: Efficient Blocking Method for Record Linkage","authors":"K. O'Hare, Anna Jurek-Loughrey, Cassio P. de Campos","doi":"10.1145/3450527","DOIUrl":"https://doi.org/10.1145/3450527","url":null,"abstract":"Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127038517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Constrained Dual-Level Bandit for Personalized Impression Regulation in Online Ranking Systems 在线排名系统中个性化印象调节的约束双级强盗

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-07-21 DOI: 10.1145/3461340

Zhao Li, Junshuai Song, Zehong Hu, Zhen Wang, Jun Gao

{"title":"Constrained Dual-Level Bandit for Personalized Impression Regulation in Online Ranking Systems","authors":"Zhao Li, Junshuai Song, Zehong Hu, Zhen Wang, Jun Gao","doi":"10.1145/3461340","DOIUrl":"https://doi.org/10.1145/3461340","url":null,"abstract":"Impression regulation plays an important role in various online ranking systems, e.g., e-commerce ranking systems always need to achieve local commercial demands on some pre-labeled target items like fresh item cultivation and fraudulent item counteracting while maximizing its global revenue. However, local impression regulation may cause “butterfly effects” on the global scale, e.g., in e-commerce, the price preference fluctuation in initial conditions (overpriced or underpriced items) may create a significantly different outcome, thus affecting shopping experience and bringing economic losses to platforms. To prevent “butterfly effects”, some researchers define their regulation objectives with global constraints, by using contextual bandit at the page-level that requires all items on one page sharing the same regulation action, which fails to conduct impression regulation on individual items. To address this problem, in this article, we propose a personalized impression regulation method that can directly makes regulation decisions for each user-item pair. Specifically, we model the regulation problem as a Constrained Dual-level Bandit (CDB) problem, where the local regulation action and reward signals are at the item-level while the global effect constraint on the platform impression can be calculated at the page-level only. To handle the asynchronous signals, we first expand the page-level constraint to the item-level and then derive the policy updating as a second-order cone optimization problem. Our CDB approaches the optimal policy by iteratively solving the optimization problem. Experiments are performed on both offline and online datasets, and the results, theoretically and empirically, demonstrate CDB outperforms state-of-the-art algorithms.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"54 27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124700581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction Stream 一种基于概要的海量多事务流项集频率估计方法

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-07-21 DOI: 10.1145/3465238

Guangtao Wang, G. Cong, Ying Zhang, Zhen Hai, Jieping Ye

{"title":"A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction Stream","authors":"Guangtao Wang, G. Cong, Ying Zhang, Zhen Hai, Jieping Ye","doi":"10.1145/3465238","DOIUrl":"https://doi.org/10.1145/3465238","url":null,"abstract":"The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k-Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128185663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1