Xiaopeng Fan, Xinchun Liu, Yang Wang, Youjun Wang, Jing Li
{"title":"Optimizing Multi-way Theta Join for Data Skew in Sub-second Stream Computing","authors":"Xiaopeng Fan, Xinchun Liu, Yang Wang, Youjun Wang, Jing Li","doi":"10.1109/ICPADS51040.2020.00068","DOIUrl":null,"url":null,"abstract":"In sub-second stream computing, the answer to a complex query usually depends on operations of aggregation or join on streams, especially multi-way theta join. Some attribute keys are not distributed uniformly, which is called the data intrinsic skew problem, such as taxi car plate in GPS trajectories and transaction records, or stock code in stock quotes and investment portfolios etc. In this paper, we define the concept of key redundancy for single stream as the degree of data intrinsic skew, and joint key redundancy for multi-way streams. We present an execution model for multi-way stream theta joins with a fine-grained cost model to evaluate its performance. We propose a solution named Group Join (GroJoin) to make use of key redundancy during transmission and execution in a cluster. GroJoin is adaptive to data intrinsic skew in the way that it depends on the grouping condition we find out, i.e., the selectivity of theta join results should be smaller than 25%. Experiments are carried out by our MS-Generator to produce multi-way streams, and the simulation results show that GroJoin can decrease at most 45% transmission overheads with different key redundancies and value-key proportionality coefficients, and reduce at most 70% query delay with different key distributions. We further implement GroJoin in Multi-Way Stream Theta Join by Spark Streaming. The experimental results demonstrate that there are about 40%∼50% join latency reduced after our optimization with a very small computation cost.","PeriodicalId":196548,"journal":{"name":"2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)","volume":"14611 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS51040.2020.00068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In sub-second stream computing, the answer to a complex query usually depends on operations of aggregation or join on streams, especially multi-way theta join. Some attribute keys are not distributed uniformly, which is called the data intrinsic skew problem, such as taxi car plate in GPS trajectories and transaction records, or stock code in stock quotes and investment portfolios etc. In this paper, we define the concept of key redundancy for single stream as the degree of data intrinsic skew, and joint key redundancy for multi-way streams. We present an execution model for multi-way stream theta joins with a fine-grained cost model to evaluate its performance. We propose a solution named Group Join (GroJoin) to make use of key redundancy during transmission and execution in a cluster. GroJoin is adaptive to data intrinsic skew in the way that it depends on the grouping condition we find out, i.e., the selectivity of theta join results should be smaller than 25%. Experiments are carried out by our MS-Generator to produce multi-way streams, and the simulation results show that GroJoin can decrease at most 45% transmission overheads with different key redundancies and value-key proportionality coefficients, and reduce at most 70% query delay with different key distributions. We further implement GroJoin in Multi-Way Stream Theta Join by Spark Streaming. The experimental results demonstrate that there are about 40%∼50% join latency reduced after our optimization with a very small computation cost.