A Critical Analysis of Classifier Selection in Learned Bloom Filters

International Conference on Engineering Applications of Neural Networks Pub Date : 2022-11-28 DOI:10.48550/arXiv.2211.15565

D. Malchiodi, Davide Raimondi, G. Fumagalli, R. Giancarlo, Marco Frasca

{"title":"A Critical Analysis of Classifier Selection in Learned Bloom Filters","authors":"D. Malchiodi, Davide Raimondi, G. Fumagalli, R. Giancarlo, Marco Frasca","doi":"10.48550/arXiv.2211.15565","DOIUrl":null,"url":null,"abstract":"Learned Bloom Filters, i.e., models induced from data via machine learning techniques and solving the approximate set membership problem, have recently been introduced with the aim of enhancing the performance of standard Bloom Filters, with special focus on space occupancy. Unlike in the classical case, the\"complexity\"of the data used to build the filter might heavily impact on its performance. Therefore, here we propose the first in-depth analysis, to the best of our knowledge, for the performance assessment of a given Learned Bloom Filter, in conjunction with a given classifier, on a dataset of a given classification complexity. Indeed, we propose a novel methodology, supported by software, for designing, analyzing and implementing Learned Bloom Filters in function of specific constraints on their multi-criteria nature (that is, constraints involving space efficiency, false positive rate, and reject time). Our experiments show that the proposed methodology and the supporting software are valid and useful: we find out that only two classifiers have desirable properties in relation to problems with different data complexity, and, interestingly, none of them has been considered so far in the literature. We also experimentally show that the Sandwiched variant of Learned Bloom filters is the most robust to data complexity and classifier performance variability, as well as those usually having smaller reject times. The software can be readily used to test new Learned Bloom Filter proposals, which can be compared with the best ones identified here.","PeriodicalId":202517,"journal":{"name":"International Conference on Engineering Applications of Neural Networks","volume":"150 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Engineering Applications of Neural Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2211.15565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Learned Bloom Filters, i.e., models induced from data via machine learning techniques and solving the approximate set membership problem, have recently been introduced with the aim of enhancing the performance of standard Bloom Filters, with special focus on space occupancy. Unlike in the classical case, the"complexity"of the data used to build the filter might heavily impact on its performance. Therefore, here we propose the first in-depth analysis, to the best of our knowledge, for the performance assessment of a given Learned Bloom Filter, in conjunction with a given classifier, on a dataset of a given classification complexity. Indeed, we propose a novel methodology, supported by software, for designing, analyzing and implementing Learned Bloom Filters in function of specific constraints on their multi-criteria nature (that is, constraints involving space efficiency, false positive rate, and reject time). Our experiments show that the proposed methodology and the supporting software are valid and useful: we find out that only two classifiers have desirable properties in relation to problems with different data complexity, and, interestingly, none of them has been considered so far in the literature. We also experimentally show that the Sandwiched variant of Learned Bloom filters is the most robust to data complexity and classifier performance variability, as well as those usually having smaller reject times. The software can be readily used to test new Learned Bloom Filter proposals, which can be compared with the best ones identified here.

查看原文本刊更多论文

学习布隆过滤器中分类器选择的关键分析

最近引入了学习布隆过滤器，即通过机器学习技术从数据中导出模型并解决近似集隶属度问题，目的是提高标准布隆过滤器的性能，特别关注空间占用。与经典情况不同，用于构建过滤器的数据的“复杂性”可能会严重影响其性能。因此，在这里，我们提出了第一个深入分析，据我们所知，对于给定的学习布隆过滤器，结合给定的分类器，在给定分类复杂性的数据集上的性能评估。事实上，我们提出了一种新的方法，由软件支持，用于设计，分析和实现学习布隆过滤器在其多标准性质的特定约束函数(即涉及空间效率，假阳性率和拒绝时间的约束)。我们的实验表明，所提出的方法和支持软件是有效和有用的:我们发现只有两个分类器在不同数据复杂性的问题上具有理想的属性，有趣的是，到目前为止，它们都没有在文献中被考虑过。我们还通过实验表明，夹层式学习布隆过滤器对数据复杂性和分类器性能变异性以及通常具有较小拒绝时间的过滤器具有最强大的鲁棒性。该软件可以很容易地用于测试新的学习布隆过滤器建议，可以与这里确定的最佳建议进行比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Engineering Applications of Neural Networks

自引率

0.00%

发文量