{"title":"Stable Cuckoo Filter for Data Streams","authors":"Shangsen Li, Lailong Luo, Deke Guo, Yawei Zhao","doi":"10.1109/ICPADS53394.2021.00023","DOIUrl":null,"url":null,"abstract":"Cuckoo filter (CF), Bloom filter (BF) and their variants are space-efficient probabilistic data structures for approximate set membership queries. However, their data synopsis would inevitably become unusable when there are a number of member updates on the set; while updates are not uncommon for the real-world data streaming applications such as duplicate item detection, malicious URL checking, and caching applications. It has been shown that some variants of BF can be adaptive to stream applications. However, current extensions of BF structures generally incur unstable performance or intolerant membership testing errors. In this paper, we aim to design a data synopsis for membership testing on data streams with stable performance and tolerant query errors. To this end, we propose Stable Cuckoo Filters (SCF), which take a fine-grained manner to evict the stale elements and store those more recent ones. SCF absorbs the design philosophy from several unsuccessful designs. Specifically, SCFs take elegant update operations to embed time information with insertion operation and carefully evict the stale elements. We show that a tight upper bound of the expected false positive rate (FPR) remains asymptotically constant over the insertion of new members. The query error for recent elements of SCF (FNR) is related to the characteristics of the input data stream and query workloads. Extensive experiments on the real-world and synthetic datasets show that our designs are more stable than the existing variants of BF and realize 7 x smaller false errors and up to 3 x throughput.","PeriodicalId":309508,"journal":{"name":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS53394.2021.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Cuckoo filter (CF), Bloom filter (BF) and their variants are space-efficient probabilistic data structures for approximate set membership queries. However, their data synopsis would inevitably become unusable when there are a number of member updates on the set; while updates are not uncommon for the real-world data streaming applications such as duplicate item detection, malicious URL checking, and caching applications. It has been shown that some variants of BF can be adaptive to stream applications. However, current extensions of BF structures generally incur unstable performance or intolerant membership testing errors. In this paper, we aim to design a data synopsis for membership testing on data streams with stable performance and tolerant query errors. To this end, we propose Stable Cuckoo Filters (SCF), which take a fine-grained manner to evict the stale elements and store those more recent ones. SCF absorbs the design philosophy from several unsuccessful designs. Specifically, SCFs take elegant update operations to embed time information with insertion operation and carefully evict the stale elements. We show that a tight upper bound of the expected false positive rate (FPR) remains asymptotically constant over the insertion of new members. The query error for recent elements of SCF (FNR) is related to the characteristics of the input data stream and query workloads. Extensive experiments on the real-world and synthetic datasets show that our designs are more stable than the existing variants of BF and realize 7 x smaller false errors and up to 3 x throughput.