Stable Cuckoo Filter for Data Streams

Shangsen Li, Lailong Luo, Deke Guo, Yawei Zhao
{"title":"Stable Cuckoo Filter for Data Streams","authors":"Shangsen Li, Lailong Luo, Deke Guo, Yawei Zhao","doi":"10.1109/ICPADS53394.2021.00023","DOIUrl":null,"url":null,"abstract":"Cuckoo filter (CF), Bloom filter (BF) and their variants are space-efficient probabilistic data structures for approximate set membership queries. However, their data synopsis would inevitably become unusable when there are a number of member updates on the set; while updates are not uncommon for the real-world data streaming applications such as duplicate item detection, malicious URL checking, and caching applications. It has been shown that some variants of BF can be adaptive to stream applications. However, current extensions of BF structures generally incur unstable performance or intolerant membership testing errors. In this paper, we aim to design a data synopsis for membership testing on data streams with stable performance and tolerant query errors. To this end, we propose Stable Cuckoo Filters (SCF), which take a fine-grained manner to evict the stale elements and store those more recent ones. SCF absorbs the design philosophy from several unsuccessful designs. Specifically, SCFs take elegant update operations to embed time information with insertion operation and carefully evict the stale elements. We show that a tight upper bound of the expected false positive rate (FPR) remains asymptotically constant over the insertion of new members. The query error for recent elements of SCF (FNR) is related to the characteristics of the input data stream and query workloads. Extensive experiments on the real-world and synthetic datasets show that our designs are more stable than the existing variants of BF and realize 7 x smaller false errors and up to 3 x throughput.","PeriodicalId":309508,"journal":{"name":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS53394.2021.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Cuckoo filter (CF), Bloom filter (BF) and their variants are space-efficient probabilistic data structures for approximate set membership queries. However, their data synopsis would inevitably become unusable when there are a number of member updates on the set; while updates are not uncommon for the real-world data streaming applications such as duplicate item detection, malicious URL checking, and caching applications. It has been shown that some variants of BF can be adaptive to stream applications. However, current extensions of BF structures generally incur unstable performance or intolerant membership testing errors. In this paper, we aim to design a data synopsis for membership testing on data streams with stable performance and tolerant query errors. To this end, we propose Stable Cuckoo Filters (SCF), which take a fine-grained manner to evict the stale elements and store those more recent ones. SCF absorbs the design philosophy from several unsuccessful designs. Specifically, SCFs take elegant update operations to embed time information with insertion operation and carefully evict the stale elements. We show that a tight upper bound of the expected false positive rate (FPR) remains asymptotically constant over the insertion of new members. The query error for recent elements of SCF (FNR) is related to the characteristics of the input data stream and query workloads. Extensive experiments on the real-world and synthetic datasets show that our designs are more stable than the existing variants of BF and realize 7 x smaller false errors and up to 3 x throughput.
稳定的杜鹃过滤器的数据流
布谷鸟滤波器(CF)、布隆滤波器(BF)及其变体是用于近似集隶属度查询的空间效率高的概率数据结构。然而,当集合上有许多成员更新时,他们的数据概要将不可避免地变得不可用;而对于真实世界的数据流应用程序(如重复项检测、恶意URL检查和缓存应用程序)来说,更新并不少见。研究表明,BF的一些变体可以适应流应用。然而,目前的BF结构扩展通常会导致不稳定的性能或不可容忍的成员测试误差。在本文中,我们的目标是设计一个性能稳定、查询错误容忍度高的数据流隶属度测试数据概要。为此,我们提出了稳定杜鹃过滤器(SCF),它采用细粒度的方式来剔除过时的元素并存储最新的元素。SCF从几个不成功的设计中吸取了设计理念。具体来说,scf采用优雅的更新操作,通过插入操作嵌入时间信息,并小心地剔除过时的元素。我们证明了期望假阳性率(FPR)的紧上界在新成员的插入上保持渐近常数。SCF最近元素的查询错误与输入数据流的特征和查询工作负载有关。在真实世界和合成数据集上的大量实验表明,我们的设计比现有的BF变体更稳定,并且实现了7倍小的假误差和高达3倍的吞吐量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信