Performance evaluation of bloom filter size in map-side and reduce-side bloom joins

2017 8th International Conference on Information and Communication Systems (ICICS) Pub Date : 2017-04-01 DOI:10.1109/IACS.2017.7921965

A. Al-Badarneh, Hassan M. Najadat, Salah Rababah

{"title":"Performance evaluation of bloom filter size in map-side and reduce-side bloom joins","authors":"A. Al-Badarneh, Hassan M. Najadat, Salah Rababah","doi":"10.1109/IACS.2017.7921965","DOIUrl":null,"url":null,"abstract":"Map Reduce (MP) Is an efficient programming model for processing big data. However, MR has some limitations in performing the join operation. Recent researches have been made to alleviate this problem, such as Bloom join. The idea of the Bloom join lies in constructing a Bloom filter to remove redundant records before performing the join operation. The size of the constructed filter is very critical and it should be chosen in a good manner. In this paper, we evaluate the performance of the Bloom filter size for two Bloom join algorithms, Map-side Bloom join and Reduce-side Bloom join. In our methodology, we constructed multiple Bloom filters with different sizes for two static input datasets. Our experimental results show that it is not always the best solution to construct a small or a large filter size to produce a good performance, it should be constructed based on the size of the input datasets. Also, the results show that tuning the Bloom filter size causes major effects on the join performance. Furthermore, the results show that it is recommended to choose small sizes of the Bloom filter, small enough to produce neglected false positive rate, in the implementation of the two algorithms when there is a concern about the memory. On the other hand, small to medium sizes of the Bloom filter in the Reduce-side join produce smaller elapsed time compared to the Map-side join, while large sizes produce larger elapsed time.","PeriodicalId":180504,"journal":{"name":"2017 8th International Conference on Information and Communication Systems (ICICS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 8th International Conference on Information and Communication Systems (ICICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IACS.2017.7921965","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Map Reduce (MP) Is an efficient programming model for processing big data. However, MR has some limitations in performing the join operation. Recent researches have been made to alleviate this problem, such as Bloom join. The idea of the Bloom join lies in constructing a Bloom filter to remove redundant records before performing the join operation. The size of the constructed filter is very critical and it should be chosen in a good manner. In this paper, we evaluate the performance of the Bloom filter size for two Bloom join algorithms, Map-side Bloom join and Reduce-side Bloom join. In our methodology, we constructed multiple Bloom filters with different sizes for two static input datasets. Our experimental results show that it is not always the best solution to construct a small or a large filter size to produce a good performance, it should be constructed based on the size of the input datasets. Also, the results show that tuning the Bloom filter size causes major effects on the join performance. Furthermore, the results show that it is recommended to choose small sizes of the Bloom filter, small enough to produce neglected false positive rate, in the implementation of the two algorithms when there is a concern about the memory. On the other hand, small to medium sizes of the Bloom filter in the Reduce-side join produce smaller elapsed time compared to the Map-side join, while large sizes produce larger elapsed time.

查看原文本刊更多论文

图侧和约侧布兰连接中布兰过滤器尺寸的性能评价

Map Reduce (MP)是一种高效的处理大数据的编程模型。然而，MR在执行连接操作方面有一些限制。近年来的一些研究已经缓解了这一问题，如Bloom join。Bloom连接的思想在于构造一个Bloom过滤器，以便在执行连接操作之前删除冗余记录。构建过滤器的大小是非常关键的，应该以良好的方式选择。在本文中，我们评估了两种Bloom连接算法(Map-side Bloom join和Reduce-side Bloom join)的Bloom过滤器大小的性能。在我们的方法中，我们为两个静态输入数据集构建了多个不同大小的Bloom过滤器。我们的实验结果表明，构建一个小或大的过滤器尺寸并不总是产生良好性能的最佳解决方案，它应该基于输入数据集的大小来构建。此外，结果表明，调整Bloom过滤器大小会对连接性能产生重大影响。此外，结果表明，在考虑内存的情况下，建议在两种算法的实现中选择小尺寸的布隆滤波器，小到足以产生忽略的误报率。另一方面，与map端连接相比，reduce端连接中小到中等大小的Bloom过滤器产生的运行时间更短，而大大小的Bloom过滤器产生的运行时间更长。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 8th International Conference on Information and Communication Systems (ICICS)

自引率

0.00%

发文量