A performance study of the chain sampling algorithm

Rayane El Sibai, Yousra Chabchoub, J. Demerjian, Zakia Kazi-Aoul, Kabalan Barbar
{"title":"A performance study of the chain sampling algorithm","authors":"Rayane El Sibai, Yousra Chabchoub, J. Demerjian, Zakia Kazi-Aoul, Kabalan Barbar","doi":"10.1109/INTELCIS.2015.7397265","DOIUrl":null,"url":null,"abstract":"On-line data stream analysis is an important challenge today because of the always-increasing rates of the streams issued from multiple heterogeneous sources, in many application domains. To reduce the amount of the data stream, several sampling methods were designed by the data stream research community. We focus in this paper, on the chain sampling algorithm proposed by Babcock et al. The aim of this algorithm is to select randomly and at any time, a given fixed proportion from the most recent items of the stream contained in the last sliding window. This algorithm is well adapted to the stream context, as only one pass over the data is performed. Moreover it uses a small memory, as it does not store all the items of the current sliding window. We show in this paper that the chain sampling algorithm suffers from some collision or redundancy problems. The collision occurs when the same item is selected as a sample more than once during the execution of the algorithm. We propose two approaches to overcome this weakness and improve the chain sampling algorithm. The first one is called “inverting the selection for a high sampling rate” and the second one is inspired from the “divide to conquer strategy”. Different experimentations are performed to show the efficiency of these two improvements, in particular their impact on the execution time of the algorithm.","PeriodicalId":6478,"journal":{"name":"2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)","volume":"91 1","pages":"487-494"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INTELCIS.2015.7397265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

On-line data stream analysis is an important challenge today because of the always-increasing rates of the streams issued from multiple heterogeneous sources, in many application domains. To reduce the amount of the data stream, several sampling methods were designed by the data stream research community. We focus in this paper, on the chain sampling algorithm proposed by Babcock et al. The aim of this algorithm is to select randomly and at any time, a given fixed proportion from the most recent items of the stream contained in the last sliding window. This algorithm is well adapted to the stream context, as only one pass over the data is performed. Moreover it uses a small memory, as it does not store all the items of the current sliding window. We show in this paper that the chain sampling algorithm suffers from some collision or redundancy problems. The collision occurs when the same item is selected as a sample more than once during the execution of the algorithm. We propose two approaches to overcome this weakness and improve the chain sampling algorithm. The first one is called “inverting the selection for a high sampling rate” and the second one is inspired from the “divide to conquer strategy”. Different experimentations are performed to show the efficiency of these two improvements, in particular their impact on the execution time of the algorithm.
链式采样算法的性能研究
在线数据流分析是当今的一个重要挑战,因为在许多应用领域中,来自多个异构源的数据流的比率一直在增加。为了减少数据流的量,数据流研究界设计了几种采样方法。本文主要研究Babcock等人提出的链式采样算法。该算法的目的是在任何时候随机地从包含在最后一个滑动窗口中的流的最近项中选择给定的固定比例。该算法很好地适应了流上下文,因为只对数据执行一次传递。此外,它使用很小的内存,因为它不存储当前滑动窗口的所有项目。本文证明了链式采样算法存在一些碰撞或冗余问题。当在算法执行期间多次选择同一项作为样本时,就会发生冲突。我们提出了两种方法来克服这一缺点并改进链采样算法。第一种方法被称为“高采样率的反向选择”,第二种方法的灵感来自于“分而治之”策略。通过不同的实验来证明这两种改进的效率,特别是它们对算法执行时间的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信