A performance study of the chain sampling algorithm

2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS) Pub Date : 2015-12-01 DOI:10.1109/INTELCIS.2015.7397265

Rayane El Sibai, Yousra Chabchoub, J. Demerjian, Zakia Kazi-Aoul, Kabalan Barbar

{"title":"A performance study of the chain sampling algorithm","authors":"Rayane El Sibai, Yousra Chabchoub, J. Demerjian, Zakia Kazi-Aoul, Kabalan Barbar","doi":"10.1109/INTELCIS.2015.7397265","DOIUrl":null,"url":null,"abstract":"On-line data stream analysis is an important challenge today because of the always-increasing rates of the streams issued from multiple heterogeneous sources, in many application domains. To reduce the amount of the data stream, several sampling methods were designed by the data stream research community. We focus in this paper, on the chain sampling algorithm proposed by Babcock et al. The aim of this algorithm is to select randomly and at any time, a given fixed proportion from the most recent items of the stream contained in the last sliding window. This algorithm is well adapted to the stream context, as only one pass over the data is performed. Moreover it uses a small memory, as it does not store all the items of the current sliding window. We show in this paper that the chain sampling algorithm suffers from some collision or redundancy problems. The collision occurs when the same item is selected as a sample more than once during the execution of the algorithm. We propose two approaches to overcome this weakness and improve the chain sampling algorithm. The first one is called “inverting the selection for a high sampling rate” and the second one is inspired from the “divide to conquer strategy”. Different experimentations are performed to show the efficiency of these two improvements, in particular their impact on the execution time of the algorithm.","PeriodicalId":6478,"journal":{"name":"2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)","volume":"91 1","pages":"487-494"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INTELCIS.2015.7397265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

On-line data stream analysis is an important challenge today because of the always-increasing rates of the streams issued from multiple heterogeneous sources, in many application domains. To reduce the amount of the data stream, several sampling methods were designed by the data stream research community. We focus in this paper, on the chain sampling algorithm proposed by Babcock et al. The aim of this algorithm is to select randomly and at any time, a given fixed proportion from the most recent items of the stream contained in the last sliding window. This algorithm is well adapted to the stream context, as only one pass over the data is performed. Moreover it uses a small memory, as it does not store all the items of the current sliding window. We show in this paper that the chain sampling algorithm suffers from some collision or redundancy problems. The collision occurs when the same item is selected as a sample more than once during the execution of the algorithm. We propose two approaches to overcome this weakness and improve the chain sampling algorithm. The first one is called “inverting the selection for a high sampling rate” and the second one is inspired from the “divide to conquer strategy”. Different experimentations are performed to show the efficiency of these two improvements, in particular their impact on the execution time of the algorithm.

查看原文本刊更多论文

链式采样算法的性能研究

在线数据流分析是当今的一个重要挑战，因为在许多应用领域中，来自多个异构源的数据流的比率一直在增加。为了减少数据流的量，数据流研究界设计了几种采样方法。本文主要研究Babcock等人提出的链式采样算法。该算法的目的是在任何时候随机地从包含在最后一个滑动窗口中的流的最近项中选择给定的固定比例。该算法很好地适应了流上下文，因为只对数据执行一次传递。此外，它使用很小的内存，因为它不存储当前滑动窗口的所有项目。本文证明了链式采样算法存在一些碰撞或冗余问题。当在算法执行期间多次选择同一项作为样本时，就会发生冲突。我们提出了两种方法来克服这一缺点并改进链采样算法。第一种方法被称为“高采样率的反向选择”，第二种方法的灵感来自于“分而治之”策略。通过不同的实验来证明这两种改进的效率，特别是它们对算法执行时间的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)

自引率

0.00%

发文量