Yang Cao, W. Fan, Wenzhi Fu, Ruochun Jin, Weijie Ou, Wenliang Yi
{"title":"Extracting Graphs Properties with Semantic Joins","authors":"Yang Cao, W. Fan, Wenzhi Fu, Ruochun Jin, Weijie Ou, Wenliang Yi","doi":"10.1109/ICDE55515.2023.00175","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00175","url":null,"abstract":"This paper proposes an approach to querying a relational database $mathcal{D}$ and a graph G taken together in SQL. We introduce a semantic extension of joins across $mathcal{D}$ and G such that if a tuple t in $mathcal{D}$ and a vertex v in G refer to the same real-world entity, then we join t and v to correlate their information and complement tuple t with additional properties of vertex v from the graph. Moreover, we extract hidden relationships between t and other entities by exploring paths from v. To support the semantic joins, we develop an extraction scheme based on LSTM, path clustering and ranking, to fetch important properties from graphs, and incrementally maintain the extracted data in response to updates. We also provide methods for implementing static joins when t is a tuple in $mathcal{D}$, dynamic joins when t comes from the intermediate result of a sub-query, and heuristic joins to strike a balance between the complexity and accuracy. Using real-life data and queries, we experimentally verify the effectiveness, scalability and efficiency of the methods.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114489214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Reuse for GPU Subgraph Enumeration (Extended Abstract)","authors":"Wentian Guo, Yuchen Li, K. Tan","doi":"10.1109/ICDE55515.2023.00309","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00309","url":null,"abstract":"Subgraph enumeration is important for many applications such as network motif discovery, community detection, and frequent subgraph mining. To accelerate the execution, recent works utilize graphics processing units (GPUs) to parallelize subgraph enumeration. The performances of these parallel schemes are dominated by the set intersection operations which account for up to 95% of the total processing time. (Un)surprisingly, a significant portion (as high as 99%) of these operations is actually redundant, i.e., the same set of vertices is repeatedly encountered and evaluated. Therefore, in this paper, we seek to salvage and recycle the results of such operations to avoid repeated computation. Our solution consists of two phases. In the first phase, we generate a reusable plan that determines the opportunity for reuse. The plan is based on a novel reuse discovery mechanism that can identify available results to prevent redundant computation. In the second phase, the plan is executed to produce the subgraph enumeration results. This processing is based on a newly designed reusable parallel search strategy that can efficiently maintain and retrieve the results of set intersection operations. Our implementation on GPUs shows that our approach can achieve up to 5 times speedups compared with the state-of-the-art GPU solutions.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114763737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keyword-based Socially Tenuous Group Queries","authors":"Huaijie Zhu, Wei Liu, Jian Yin, Ningning Cui, Jianliang Xu, Xinfeng Huang, Wang-Chien Lee","doi":"10.1109/ICDE55515.2023.00079","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00079","url":null,"abstract":"Socially tenuous groups (or simply tenuous groups) in a social network/graph refer to subgraphs with few social interactions and weak relationships among members. However, existing studies on tenuous group queries do not consider the user profiles (keywords) of the members whereas in many social network applications, e.g., finding reviewers for paper selection and recommending seed users in social advertising, keywords also need to be considered. Thus, in this paper, we investigate the problem of keywords-based socially tenous group (KTG) queries. A KTG query is to find top N tenuous groups in which the members of each group jointly cover the most number of query keywords. To address the KTG problem, we first propose two exact algorithms, namely KTG-VKC and KTG-VKC-DEG, which give priority to the valid keyword coverage and the combination of valid keyword coverage and degree, respectively, to select members to form a feasible group by adopting a branch and bound (BB) strategy. Moreover, we propose keyword pruning and k-line filtering to accelerate the algorithms. To yield diversified KTG results, we also study the problem of diversified keywords-based socially tenous group (DKTG) queries. To deal with the DKTG problem, we propose a DKTG-Greedy algorithm by exploiting a greedy heuristic in combination with KTG-VKC-DEG. Furthermore, we design two alternative indexes, namely NL and NLRNL, to efficiently check whether the social distance of any two members is greater than the social constraint k in the above algorithms. We conduct extensive experiments using real datasets to validate our ideas and evaluate the proposed algorithms. Experimental results show that the NLRNL index achieves a better performance than the NL index.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114641855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications (Extended abstract)","authors":"Shanjian Tang, Bin He, Ce Yu, Yusen Li, Kun Li","doi":"10.1109/ICDE55515.2023.00316","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00316","url":null,"abstract":"With the explosive increase of big data in industry and academic fields, it is important to apply large-scale data processing systems to analyze Big Data. Arguably, Spark is the state-of-the-art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Additionally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117189740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue Cui, Shuhao Li, W. Deng, Zhaokun Zhang, Jing Zhao, Kai Zheng, Xiaofang Zhou
{"title":"ROI-demand Traffic Prediction: A Pre-train, Query and Fine-tune Framework","authors":"Yue Cui, Shuhao Li, W. Deng, Zhaokun Zhang, Jing Zhao, Kai Zheng, Xiaofang Zhou","doi":"10.1109/ICDE55515.2023.00107","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00107","url":null,"abstract":"Traffic prediction has drawn increasing attention due to its essential role in smart city applications. To achieve precise predictions, a large number of approaches have been proposed to model spatial dependencies and temporal dynamics. Despite their superior performance, most existing studies focus datasets that are usually in large geographic scales, e.g., citywide, while ignoring the results on specific regions. However, in many scenarios, for example, route planning on time-dependent road networks, only small regions are of interest. We name the task of answering forecasting requests from any query region of interest (ROI) as ROI-demand traffic prediction (RTP). In this paper, we make a primary observation that existing methods fail to jointly achieve effectiveness and efficiency for RTP. To address this issue, a novel model-agnostic framework based on pre-Training, Querying and fine-Tuning, named TQT, is proposed, which first customizes input data given an ROI, and then makes fast adaptation from pre-trained traffic prediction backbone models by fine-tuning. We evaluate TQT on two real-world traffic datasets, performing both flow and speed prediction tasks. Extensive experiment results demonstrate the effectiveness and efficiency of the proposed method.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117292395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Guo, Youfang Lin, Letian Gong, Chenyu Wang, Zeyu Zhou, Zekai Shen, Yiheng Huang, Huaiyu Wan
{"title":"Self-Supervised Spatial-Temporal Bottleneck Attentive Network for Efficient Long-term Traffic Forecasting","authors":"S. Guo, Youfang Lin, Letian Gong, Chenyu Wang, Zeyu Zhou, Zekai Shen, Yiheng Huang, Huaiyu Wan","doi":"10.1109/ICDE55515.2023.00125","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00125","url":null,"abstract":"In intelligent transportation systems, accurate long-term traffic forecasting is informative for administrators and travelers to make wise decisions in advance. Recently proposed spatial-temporal forecasting models perform well for short-term traffic forecasting, but two challenges hinder their applications for long-term forecasting in practice. Firstly, existing traffic forecasting models do not have satisfactory scalability on effectiveness and efficiency, i.e., as the prediction time spans extend, existing models either cannot capture the long-term spatial-temporal dynamics of traffic data or equip global receptive fields at the cost of quadratic computational complexity. Secondly, the dilemma between the models’ strong appetite for high-quality training data and their generalization ability is also a challenge we have to face. Thus how to improve data utilization efficiency deserves thoughtful thinking. Aiming at solving the long-term traffic forecasting problem and facilitating the deployment of traffic forecasting models in practice, this paper proposes an efficient and effective Self-supervised Spatial-Temporal Bottleneck Attentive Network (SSTBAN). Specifically, SSTBAN follows a multi-task framework by incorporating a self-supervised learner to produce robust latent representations for historical traffic data, so as to improve its generalization performance and robustness for forecasting. Besides, we design a spatial-temporal bottleneck attention mechanism, reducing the computational complexity meanwhile encoding global spatial-temporal dynamics. Extensive experiments on real-world long-term traffic forecasting tasks, including traffic speed forecasting and traffic flow forecasting under nine scenarios, demonstrate that SSTBAN not only achieves the overall best performance but also has good computation efficiency and data utilization efficiency.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117312366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed (α, β)-Core Decomposition over Bipartite Graphs","authors":"Qing Liu, Xuankun Liao, Xinfeng Huang, Jianliang Xu, Yunjun Gao","doi":"10.1109/ICDE55515.2023.00075","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00075","url":null,"abstract":"(α, β)-core is an important cohesive subgraph model for bipartite graphs. Given a bipartite graph G, the problem of (α, β)-core decomposition is to compute non-empty (α, β)-cores for all possible values of α and β. The state-of-the-art (α, β)-core decomposition algorithm is a peeling-based algorithm, which iteratively deletes the vertex from high degree to low degree. However, as the peeling-based algorithm is designed for centralized environments, it cannot be applied to distributed environments, where graphs are partitioned and stored in different machines. Motivated by this, in this paper, we study the distributed (α, β)-core decomposition problem, aiming to develop new algorithms to support (α, β)-core decomposition in distributed environments. To this end, first, we analyze the local properties of (α, β)-core, and devise n-order Bi-indexes for the vertex, which are iteratively defined using the vertex neighbors’ (n − 1)-order Bi-indexes. Next, we propose an algorithm for (α, β)-core decomposition through iteratively calculating n-order Bi-indexes for every vertex. To further improve the efficiency of the algorithm, we propose two optimizations. Then, we extend our proposed algorithms to different distributed graph processing frameworks to make them run in distributed environments. Finally, extensive experimental results on both real and synthetic bipartite graphs demonstrate the efficiency of our proposed algorithms.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116300656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miika Hannula, Zhuoxing Zhang, Bor-Kuan Song, S. Link
{"title":"Discovery of Cross Joins (Extended Abstract)","authors":"Miika Hannula, Zhuoxing Zhang, Bor-Kuan Song, S. Link","doi":"10.1109/ICDE55515.2023.00353","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00353","url":null,"abstract":"We present exact complexity bounds on the discovery of cross joins from database relations, and algorithms that work evidently well on real-world data sets within those bounds.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123399658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yezhou Yang, Yurong Cheng, Yeru Yang, Ye Yuan, Guoren Wang
{"title":"Batch-Based Cooperative Task Assignment in Spatial Crowdsourcing","authors":"Yezhou Yang, Yurong Cheng, Yeru Yang, Ye Yuan, Guoren Wang","doi":"10.1109/ICDE55515.2023.00095","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00095","url":null,"abstract":"The rapid development of the spatial crowdsourcing platform in the fields of express delivery, food delivery, and intelligent transportation has attracted widespread attention. As a typical problem in spatial crowdsourcing, online task matching problem has been widely studied. Most of the existing researches are based on the task allocation of different optimizations under one single platform. Recently, in order to solve the situation of non-uniform distribution of tasks and crowd workers on a single platform, cross online task assignment has been proposed aiming at increasing the mutual benefit through cooperations. However, existing methods lead to the situation where the local platform lends workers to other platforms, resulting in a lack of workers of itself. In this paper, we propose a Batch-Based Cooperative Task Assignment(BCTA) problem, which enables multi-platform task assignment to be completed within a tolerant time. We design a BCTA model and propose fixed-t BCTA(FT-BCTA) algorithm and adaptive BCTA(Adt-BCTA) algorithm to solve the BCTA problem. FT-BCTA focuses on a fixed batching strategy, while Adt-BCTA considers the batching strategy adaptively according to the supply and demand of multi-platforms. Extensive experiments on both real datasets and synthetic datasets show the effectiveness and efficiency of our algorithms.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"215 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126105282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Public Transport Planning on Roads","authors":"Libin Wang, R. C. Wong","doi":"10.1109/ICDE55515.2023.00188","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00188","url":null,"abstract":"Public transport contributes significantly to addressing some city issues such as air pollution and traffic congestion. As the public transport demand changes in urban development, we need to plan new routes to match the demand. Existing methods of planning new bus routes either are inefficient in using the path’s cost or use other inaccurate cost measurements. This paper focuses on finding a new bus route efficiently on road networks. Specifically, we first propose the Bus Routing on Roads (BRR) problem which combines two common goals of minimizing the walking costs of passengers and maximizing the connectivity of the new route to the existing transit network. They are consistent with matching the demand and facilitating the transfer. We first show the NP-hardness of the BRR and design an approximation algorithm called Efficient Bus Routing on Roads (EBRR). We theoretically analyzed its approximation ratio and time complexity. Extensive evaluations with state-of-the-art solutions on three real-world datasets validate the effectiveness and efficiency of EBRR. It could recommend a new bus route with high quality in around 10 seconds, 60x faster than the baselines.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124838555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}