{"title":"基于给定准则的分布式数据集加速连接","authors":"Yevgeniya Tyryshkina","doi":"10.1109/MWENT55238.2022.9802185","DOIUrl":null,"url":null,"abstract":"This article discusses the operation of joining distributed datasets by a given criterion in distributed systems. A critical analysis of literature data on the architecture of distributed data warehouses and typical methods for joining datasets was carried out, limiting stages that slow down the process were identified. A method for accelerating the operation of data joining according to a given criterion is proposed, on the basis of which an algorithm is developed and implemented in the Apache Spark data processing environment. Experimental studies confirming the efficiency of the developed method were performed. The results of the experiments show that the proposed method can significantly increase speed of the operation compared to existing solutions. From the presented experimental data, it can be seen that for 2 TB data, the algorithm made it possible to perform the merge operation ~ 37% faster than the standard algorithm offered by the Spark SQL library, for 7 TB data it was already ~ 47%.","PeriodicalId":218866,"journal":{"name":"2022 Moscow Workshop on Electronic and Networking Technologies (MWENT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accelerating Join of Distributed Datasets by a Given Criterion\",\"authors\":\"Yevgeniya Tyryshkina\",\"doi\":\"10.1109/MWENT55238.2022.9802185\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This article discusses the operation of joining distributed datasets by a given criterion in distributed systems. A critical analysis of literature data on the architecture of distributed data warehouses and typical methods for joining datasets was carried out, limiting stages that slow down the process were identified. A method for accelerating the operation of data joining according to a given criterion is proposed, on the basis of which an algorithm is developed and implemented in the Apache Spark data processing environment. Experimental studies confirming the efficiency of the developed method were performed. The results of the experiments show that the proposed method can significantly increase speed of the operation compared to existing solutions. From the presented experimental data, it can be seen that for 2 TB data, the algorithm made it possible to perform the merge operation ~ 37% faster than the standard algorithm offered by the Spark SQL library, for 7 TB data it was already ~ 47%.\",\"PeriodicalId\":218866,\"journal\":{\"name\":\"2022 Moscow Workshop on Electronic and Networking Technologies (MWENT)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Moscow Workshop on Electronic and Networking Technologies (MWENT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MWENT55238.2022.9802185\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Moscow Workshop on Electronic and Networking Technologies (MWENT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MWENT55238.2022.9802185","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Accelerating Join of Distributed Datasets by a Given Criterion
This article discusses the operation of joining distributed datasets by a given criterion in distributed systems. A critical analysis of literature data on the architecture of distributed data warehouses and typical methods for joining datasets was carried out, limiting stages that slow down the process were identified. A method for accelerating the operation of data joining according to a given criterion is proposed, on the basis of which an algorithm is developed and implemented in the Apache Spark data processing environment. Experimental studies confirming the efficiency of the developed method were performed. The results of the experiments show that the proposed method can significantly increase speed of the operation compared to existing solutions. From the presented experimental data, it can be seen that for 2 TB data, the algorithm made it possible to perform the merge operation ~ 37% faster than the standard algorithm offered by the Spark SQL library, for 7 TB data it was already ~ 47%.