{"title":"Join Algorithms under Apache Spark: Revisited","authors":"A. Al-Badarneh","doi":"10.1145/3323933.3324094","DOIUrl":null,"url":null,"abstract":"Currently, we are dealing with large scale applications, which in turn generate massive amount of data and information. Large amount of data often requires processing algorithms using massive parallelism, where the main performance metrics is the communication cost. Apache Spark is highly scalable, fault-tolerance, and can be used across many computers. So join algorithm is one of the most widely used algorithms in database systems, but it is also a heavily time consuming operation. In this work, we will survey and criticize several implementations of Spark join algorithms and discuss their strengths and weaknesses, present a detailed comparison of these algorithms, and introduce optimization approaches to enhance and tune the performance of join algorithms.","PeriodicalId":137904,"journal":{"name":"Proceedings of the 2019 5th International Conference on Computer and Technology Applications","volume":"109 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 5th International Conference on Computer and Technology Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3323933.3324094","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Currently, we are dealing with large scale applications, which in turn generate massive amount of data and information. Large amount of data often requires processing algorithms using massive parallelism, where the main performance metrics is the communication cost. Apache Spark is highly scalable, fault-tolerance, and can be used across many computers. So join algorithm is one of the most widely used algorithms in database systems, but it is also a heavily time consuming operation. In this work, we will survey and criticize several implementations of Spark join algorithms and discuss their strengths and weaknesses, present a detailed comparison of these algorithms, and introduce optimization approaches to enhance and tune the performance of join algorithms.