按给定标准加速分布式数据集连接的方法

Q3 Mathematics

Informatsionno-Upravliaiushchie Sistemy Pub Date : 2022-10-28 DOI:10.31799/1684-8853-2022-5-2-11

Yevgeniya Tyryshkina, S. Tumkovskiy

{"title":"按给定标准加速分布式数据集连接的方法","authors":"Yevgeniya Tyryshkina, S. Tumkovskiy","doi":"10.31799/1684-8853-2022-5-2-11","DOIUrl":null,"url":null,"abstract":"Introduction: rapidly growing volumes of information pose new challenges to modern data analysis technologies. Currently, based on cost and performance considerations, data processing is usually performed in cluster systems. One of the most common related operations in analytics is the joins of datasets. Join is an extremely expensive operation that is difficult to scale and increase efficiency in distributed databases or systems based on the MapReduce paradigm. Despite the fact that a lot of effort has been put into improving the performance of this operation, often the proposed methods either require fundamental changes in the MapReduce structure, or are aimed at reducing the overhead of the operation, such as balancing the load on the network. Objective: to develop an algorithm to accelerate the integration of data sets in distributed systems. Results: a review of the Apache Spark architecture and the features of distributed computing based on MapReduce is performed, typical methods for combining datasets are analyzed, the main recommendations for optimizing the operation of combining data are presented, an algorithm that allows you to speed up the special case of combining implemented in Apache Spark is presented. This algorithm uses the methods of partitioning and partial transfer of sets to the computing nodes of the cluster, in such a way as to take advantage of the merge and broadcast associations. The experimental data presented demonstrate that the method is all the more effective the larger the volume of input data. So, for 2Tb compressed data, acceleration up to ~37% was obtained in comparison with standard Spark SQL.","PeriodicalId":36977,"journal":{"name":"Informatsionno-Upravliaiushchie Sistemy","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Method for accelerating the joining of distributed datasets by a given criterion\",\"authors\":\"Yevgeniya Tyryshkina, S. Tumkovskiy\",\"doi\":\"10.31799/1684-8853-2022-5-2-11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: rapidly growing volumes of information pose new challenges to modern data analysis technologies. Currently, based on cost and performance considerations, data processing is usually performed in cluster systems. One of the most common related operations in analytics is the joins of datasets. Join is an extremely expensive operation that is difficult to scale and increase efficiency in distributed databases or systems based on the MapReduce paradigm. Despite the fact that a lot of effort has been put into improving the performance of this operation, often the proposed methods either require fundamental changes in the MapReduce structure, or are aimed at reducing the overhead of the operation, such as balancing the load on the network. Objective: to develop an algorithm to accelerate the integration of data sets in distributed systems. Results: a review of the Apache Spark architecture and the features of distributed computing based on MapReduce is performed, typical methods for combining datasets are analyzed, the main recommendations for optimizing the operation of combining data are presented, an algorithm that allows you to speed up the special case of combining implemented in Apache Spark is presented. This algorithm uses the methods of partitioning and partial transfer of sets to the computing nodes of the cluster, in such a way as to take advantage of the merge and broadcast associations. The experimental data presented demonstrate that the method is all the more effective the larger the volume of input data. So, for 2Tb compressed data, acceleration up to ~37% was obtained in comparison with standard Spark SQL.\",\"PeriodicalId\":36977,\"journal\":{\"name\":\"Informatsionno-Upravliaiushchie Sistemy\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatsionno-Upravliaiushchie Sistemy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31799/1684-8853-2022-5-2-11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatsionno-Upravliaiushchie Sistemy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31799/1684-8853-2022-5-2-11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

摘要

导读:快速增长的信息量对现代数据分析技术提出了新的挑战。目前，基于成本和性能的考虑，数据处理通常在集群系统中执行。分析中最常见的相关操作之一是数据集的连接。Join是一个非常昂贵的操作，在基于MapReduce范式的分布式数据库或系统中很难扩展和提高效率。尽管在改进该操作的性能方面已经付出了很多努力，但通常提出的方法要么需要对MapReduce结构进行根本性的更改，要么旨在减少操作的开销，例如平衡网络上的负载。目的:开发一种加速分布式系统中数据集集成的算法。结果:回顾了Apache Spark的架构和基于MapReduce的分布式计算的特点，分析了典型的数据集组合方法，提出了优化组合数据操作的主要建议，提出了一种可以加速Apache Spark中实现的组合的特殊情况的算法。该算法采用集的划分和部分转移到集群的计算节点的方法，充分利用了集群的合并和广播关联。实验数据表明，输入数据量越大，该方法越有效。因此，对于2Tb的压缩数据，与标准Spark SQL相比，获得了高达~37%的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Method for accelerating the joining of distributed datasets by a given criterion

Introduction: rapidly growing volumes of information pose new challenges to modern data analysis technologies. Currently, based on cost and performance considerations, data processing is usually performed in cluster systems. One of the most common related operations in analytics is the joins of datasets. Join is an extremely expensive operation that is difficult to scale and increase efficiency in distributed databases or systems based on the MapReduce paradigm. Despite the fact that a lot of effort has been put into improving the performance of this operation, often the proposed methods either require fundamental changes in the MapReduce structure, or are aimed at reducing the overhead of the operation, such as balancing the load on the network. Objective: to develop an algorithm to accelerate the integration of data sets in distributed systems. Results: a review of the Apache Spark architecture and the features of distributed computing based on MapReduce is performed, typical methods for combining datasets are analyzed, the main recommendations for optimizing the operation of combining data are presented, an algorithm that allows you to speed up the special case of combining implemented in Apache Spark is presented. This algorithm uses the methods of partitioning and partial transfer of sets to the computing nodes of the cluster, in such a way as to take advantage of the merge and broadcast associations. The experimental data presented demonstrate that the method is all the more effective the larger the volume of input data. So, for 2Tb compressed data, acceleration up to ~37% was obtained in comparison with standard Spark SQL.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Informatsionno-Upravliaiushchie Sistemy Mathematics-Control and Optimization

CiteScore

1.40

自引率

0.00%

发文量