基于块分割的并行实体解析负载均衡策略分析与比较

Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services Pub Date : 2020-11-30 DOI:10.1145/3428757.3429140

Xiao Chen, Nishanth Entoor Venkatarathnam, Kirity Rapuru, David Broneske, Gabriel Campero Durand, Roman Zoun, G. Saake

{"title":"基于块分割的并行实体解析负载均衡策略分析与比较","authors":"Xiao Chen, Nishanth Entoor Venkatarathnam, Kirity Rapuru, David Broneske, Gabriel Campero Durand, Roman Zoun, G. Saake","doi":"10.1145/3428757.3429140","DOIUrl":null,"url":null,"abstract":"Entity resolution (ER) is a process to identify records that refer to the same real-world entity. In recent years, facing the ever-increasing data volume, both blocking techniques and parallel computation have been proposed for ER to reduce its running time and improve efficiency. It is popular and convenient to apply the MapReduce programming model for parallel computation. With the default load balancing strategy, if the block sizes are skewed, an imbalanced reducer load will occur and significantly increase the runtime. One possible solution is block-splitting: breaking the overpopulated blocks into smaller sub-blocks, to improve efficiency. In this paper we analyze the advantages and disadvantages of state-of-the-art block splitting methods (BlockSplit and BlockSlicer), and we propose two approaches: TLS and BOS to overcome the identified drawbacks. We comprehensively evaluate and compare our proposed solutions, with Spark implementations, using real-world and synthetic datasets with different properties. The results show that all of them can balance the reducer load with the help of the greedy partition assignment strategy. When memory of used cluster is not abundant given a dataset, a high number of reducers is required to reduce the GC time to improve efficiency. Partitcularly, our TLS and BOS have overwelmingly lower overhead due to the ability of block-wise composite key assignment.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity Resolution\",\"authors\":\"Xiao Chen, Nishanth Entoor Venkatarathnam, Kirity Rapuru, David Broneske, Gabriel Campero Durand, Roman Zoun, G. Saake\",\"doi\":\"10.1145/3428757.3429140\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Entity resolution (ER) is a process to identify records that refer to the same real-world entity. In recent years, facing the ever-increasing data volume, both blocking techniques and parallel computation have been proposed for ER to reduce its running time and improve efficiency. It is popular and convenient to apply the MapReduce programming model for parallel computation. With the default load balancing strategy, if the block sizes are skewed, an imbalanced reducer load will occur and significantly increase the runtime. One possible solution is block-splitting: breaking the overpopulated blocks into smaller sub-blocks, to improve efficiency. In this paper we analyze the advantages and disadvantages of state-of-the-art block splitting methods (BlockSplit and BlockSlicer), and we propose two approaches: TLS and BOS to overcome the identified drawbacks. We comprehensively evaluate and compare our proposed solutions, with Spark implementations, using real-world and synthetic datasets with different properties. The results show that all of them can balance the reducer load with the help of the greedy partition assignment strategy. When memory of used cluster is not abundant given a dataset, a high number of reducers is required to reduce the GC time to improve efficiency. Partitcularly, our TLS and BOS have overwelmingly lower overhead due to the ability of block-wise composite key assignment.\",\"PeriodicalId\":212557,\"journal\":{\"name\":\"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3428757.3429140\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3428757.3429140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

实体解析(ER)是一个识别引用相同现实世界实体的记录的过程。近年来，面对日益增长的数据量，为了缩短ER的运行时间和提高效率，人们提出了阻塞技术和并行计算技术。将MapReduce编程模型应用于并行计算是一种流行且方便的方法。使用默认的负载平衡策略，如果块大小倾斜，将发生不平衡的reducer负载，并显着增加运行时间。一个可能的解决方案是块分割:将人口过多的块分解成更小的子块，以提高效率。在本文中，我们分析了最先进的块分割方法(BlockSplit和BlockSlicer)的优点和缺点，并提出了两种方法:TLS和BOS来克服已确定的缺点。我们使用具有不同属性的真实数据集和合成数据集，对我们提出的解决方案与Spark实现进行了全面评估和比较。结果表明，在贪婪分区分配策略的帮助下，它们都能平衡减速机负载。当给定数据集的使用集群内存不足时，需要大量的reducer来减少GC时间以提高效率。特别是，我们的TLS和BOS由于能够按块组合密钥分配而具有非常低的开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity Resolution

Entity resolution (ER) is a process to identify records that refer to the same real-world entity. In recent years, facing the ever-increasing data volume, both blocking techniques and parallel computation have been proposed for ER to reduce its running time and improve efficiency. It is popular and convenient to apply the MapReduce programming model for parallel computation. With the default load balancing strategy, if the block sizes are skewed, an imbalanced reducer load will occur and significantly increase the runtime. One possible solution is block-splitting: breaking the overpopulated blocks into smaller sub-blocks, to improve efficiency. In this paper we analyze the advantages and disadvantages of state-of-the-art block splitting methods (BlockSplit and BlockSlicer), and we propose two approaches: TLS and BOS to overcome the identified drawbacks. We comprehensively evaluate and compare our proposed solutions, with Spark implementations, using real-world and synthetic datasets with different properties. The results show that all of them can balance the reducer load with the help of the greedy partition assignment strategy. When memory of used cluster is not abundant given a dataset, a high number of reducers is required to reduce the GC time to improve efficiency. Partitcularly, our TLS and BOS have overwelmingly lower overhead due to the ability of block-wise composite key assignment.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services

自引率

0.00%

发文量