General Temporally Biased Sampling Schemes for Online Model Management

ACM Transactions on Database Systems (TODS) Pub Date : 2019-06-11 DOI:10.1145/3360903

Brian Hentschel, P. Haas, Yuanyuan Tian

{"title":"General Temporally Biased Sampling Schemes for Online Model Management","authors":"Brian Hentschel, P. Haas, Yuanyuan Tian","doi":"10.1145/3360903","DOIUrl":null,"url":null,"abstract":"To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified “decay function.” We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (Targeted-Size Time-Biased Sampling (T-TBS)) that probabilistically maintains a target sample size and a novel reservoir-based scheme (Reservoir-Based Time-Biased Sampling (R-TBS)) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a tradeoff between sample footprint and sample-size stability. R-TBS rests on the notion of a “fractional sample” and allows for data arrival rates that are unknown and time varying (unlike T-TBS). The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"76 1","pages":"1 - 45"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems (TODS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3360903","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified “decay function.” We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (Targeted-Size Time-Biased Sampling (T-TBS)) that probabilistically maintains a target sample size and a novel reservoir-based scheme (Reservoir-Based Time-Biased Sampling (R-TBS)) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a tradeoff between sample footprint and sample-size stability. R-TBS rests on the notion of a “fractional sample” and allows for data arrival rates that are unknown and time varying (unlike T-TBS). The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.

查看原文本刊更多论文

在线模型管理的一般时间偏差抽样方案

为了在不断发展的数据流中保持监督学习模型的准确性，我们提供了时间偏差抽样方案，该方案对最近的数据进行了最重的加权，根据指定的“衰减函数”，给定数据项的包含概率随着时间的推移而衰减。然后我们定期在当前样本上重新训练模型。相对于在所有数据上进行训练，这种方法加快了训练过程。此外，时间偏置让模型适应数据的最新变化，同时——与滑动窗口方法不同——仍然保留一些旧数据，以确保面对数据值的临时波动和周期性时的鲁棒性。此外，基于采样的方法允许将现有的静态数据分析算法基本不加更改地应用于动态流数据。我们提供并分析了一种简单的采样方案(目标大小时间偏差采样(T-TBS))，它在概率上保持目标样本量，以及一种新的基于水库的方案(基于水库的时间偏差采样(R-TBS))，这是第一个提供对衰减率的控制和样本量的保证上限。如果衰减函数是指数型的，那么对衰减率的控制是完全的，R-TBS最大化了期望样本量和样本量稳定性。对于一般衰减函数，实际项目包含概率可以任意接近名义概率，并且我们提供了一个允许在样本占用和样本大小稳定性之间进行权衡的方案。R-TBS基于“分数样本”的概念，允许未知和随时间变化的数据到达率(与T-TBS不同)。R-TBS和T-TBS格式是独立的，扩展了已知的不等概率抽样格式集。我们讨论了分布式实现策略;在Spark上的实验说明了算法的性能和可扩展性，并表明我们的方法可以在面对不断变化的数据时提高机器学习的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Database Systems (TODS)

自引率

0.00%

发文量