{"title":"Distribution-Regularized Federated Learning on Non-IID Data","authors":"Yansheng Wang, Yongxin Tong, Zimu Zhou, Ruisheng Zhang, Sinno Jialin Pan, Lixin Fan, Qiang Yang","doi":"10.1109/ICDE55515.2023.00164","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00164","url":null,"abstract":"Federated learning (FL) has emerged as a popular machine learning paradigm recently. Compared with traditional distributed learning, its unique challenges mainly lie in communication efficiency and non-IID (heterogeneous data) problem. While the widely adopted framework FedAvg can reduce communication overhead significantly, its effectiveness on non-IID data still lacks exploration. In this paper, we study the non-IID problem of FL from the perspective of domain adaptation. We propose a distribution regularization for FL on non-IID data such that the discrepancy of data distributions between clients is reduced. To further reduce the communication cost, we devise two novel distributed learning algorithms, namely rFedAvg and rFedAvg+, for efficiently learning with the distribution regularization. More importantly, we theoretically establish their convergence for strongly convex objectives. Extensive experiments on 4 datasets with both CNN and LSTM as learning models verify the effectiveness and efficiency of the proposed algorithms.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129419096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"User-Defined Functions in Modern Data Engines","authors":"Ioannis Foufoulas, A. Simitsis","doi":"10.1109/ICDE55515.2023.00276","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00276","url":null,"abstract":"Modern data management applications involve complex processing tasks over large volumes of data. Although this falls naturally within the scope of relational databases, many such tasks cannot be expressed in SQL and require additional expressive power achieved via user-defined functions (UDFs). However, efficient processing of UDFs in data engines hinge on dealing with the impedance mismatch between UDF execution and SQL processing. In recent years, the problem of efficient UDF execution in modern data engines has gained significant traction. In this tutorial, we present recent advancements in this area, involving a broad scope of solutions ranging from algebraic, cost-based optimization to low level, physical query optimization, compilation, and execution. We also describe limitations and open issues, and discuss promising future research directions.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127532632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering Editing Rules by Deep Reinforcement Learning","authors":"Yinan Mei, Shaoxu Song, Chenguang Fang, Ziheng Wei, Jingyun Fang, Jiang Long","doi":"10.1109/ICDE55515.2023.00034","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00034","url":null,"abstract":"Editing rules specify the conditions of applying high quality master data to repair low quality input data. Discovering editing rules, however, is challenging, since it considers not only the well curated master data but also the large-scale input data, an extremely large search space. A natural baseline, namely EnuMiner, costly enumerates the rules with possible conditions from both master and input data. Although several pruning strategies are enabled, the algorithm still takes a long time when the enumeration space is large. To avoid enumerating all candidate rules during mining, we argue to model the rule discovery process as a Markov Decision Process. Specifically, we discover editing rules by growing a rule tree where each node corresponds to a rule. The algorithm generates a new rule from the current node as a child node. We propose a reinforcement learning-based editing rule discovery algorithm, RLMiner, which trains an agent to wisely make decisions on branches when traversing the tree. Following the idea of evaluating rules, we design a reward function that is more in line with rule discovery scenarios and makes our algorithm perform effectively and efficiently. The experimental results show that our proposed RLMiner can mine high-utility editing rules like EnuMiner and scale well on the datasets with many attributes and large domains.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130680962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuxiao Ye, Chi Harold Liu, Zipeng Dai, Jianxin R. Zhao, Ye Yuan, Guoren Wang, Jian Tang
{"title":"Exploring both Individuality and Cooperation for Air-Ground Spatial Crowdsourcing by Multi-Agent Deep Reinforcement Learning","authors":"Yuxiao Ye, Chi Harold Liu, Zipeng Dai, Jianxin R. Zhao, Ye Yuan, Guoren Wang, Jian Tang","doi":"10.1109/ICDE55515.2023.00023","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00023","url":null,"abstract":"Spatial crowdsourcing (SC) has proven as a promising paradigm to employ human workers to collect data from diverse Point-of-Interests (PoIs) in a given area. Different from using human participants, we propose a novel air-ground SC scenario to fully take advantage of benefits brought by unmanned vehicles (UVs), including unmanned aerial vehicles (UAVs) with controllable high mobility and unmanned ground vehicles (UGVs) with abundant sensing resources. The objective is to maximize the amount of collected data, geographical fairness among all PoIs, and minimize the data loss and energy consumption, integrated as one single metric called \"efficiency\". We explicitly explore both individuality and cooperation natures of UAVs and UGVs by proposing a multi-agent deep reinforcement learning (MADRL) framework called \"h/i-MADRL\". Compatible with all multi-agent actor-critic methods, h/i-MADRL adds two novel plug-in modules: (a) h-CoPO, which models the cooperation preference among heterogeneous UAVs and UGVs; and (b) i-EOI, which extracts the UV’s individuality and encourages a better spatial division of work by adding intrinsic reward. Extensive experimental results on two real-world datasets on Purdue and NCSU campuses confirm that h/i-MADRL achieves a better exploration of both individuality and cooperation simultaneously, resulting in a better performance in terms of efficiency compared with five baselines.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128836271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifan Zhang, P. Zhao, Qingyao Wu, Bin Li, Junzhou Huang, Mingkui Tan
{"title":"Cost-Sensitive Portfolio Selection via Deep Reinforcement Learning (Extended Abstract)","authors":"Yifan Zhang, P. Zhao, Qingyao Wu, Bin Li, Junzhou Huang, Mingkui Tan","doi":"10.1109/ICDE55515.2023.00312","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00312","url":null,"abstract":"Portfolio Selection is an important real-world financial task and has attracted extensive attention in artificial intelligence communities. This task, however, has two main difficulties: (i) the non-stationary price series and complex asset correlations make the learning of feature representation very hard; (ii) the practicality principle in financial markets requires controlling both transaction and risk costs. Most existing methods adopt handcraft features and/or consider no constraints for the costs, which may make them perform unsatisfactorily and fail to control both costs in practice. In this paper, we propose a cost-sensitive portfolio selection method with deep reinforcement learning. Specifically, a novel two-stream portfolio policy network is devised to extract both price series patterns and asset correlations, while a new cost-sensitive reward function is developed to maximize the accumulated return and constrain both costs via reinforcement learning. We theoretically analyze the near-optimality of the proposed reward, which shows that the growth rate of the policy regarding this reward function can approach the theoretical optimum. We also empirically evaluate the proposed method on real-world datasets. Promising results demonstrate the effectiveness and superiority of the proposed method in terms of profitability, cost-sensitivity and representation abilities.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125523814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Gurumurthy, David Broneske, Gabriel Campero Durand, Thilo Pionteck, Gunter Saake
{"title":"ADAMANT: A Query Executor with Plug-In Interfaces for Easy Co-processor Integration","authors":"B. Gurumurthy, David Broneske, Gabriel Campero Durand, Thilo Pionteck, Gunter Saake","doi":"10.1109/ICDE55515.2023.00093","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00093","url":null,"abstract":"Today’s processor landscape is increasingly heterogeneous with the availability of co-processors. This landscape impacts query engines, as they need to be reworked to keep competitive performance by leveraging the underlying architectures. Such a rework might be costly if, for each external processor or SDK, peripheral components needed to be developed as well; resulting in redundant effort and adoption difficulties. In this paper, we propose an approach to overcome these shortcomings through ADAMANT – a query executor equipped with interfaces to plug-in new co-processors without reworking other components of a query engine. ADAMANT consists of 1) pluggable interfaces that allow interaction with co-processors, encapsulating operator implementations, and 2) a unified runtime that handles the execution on arbitrary co-processors, with a chunked execution model for scalable query processing. To evaluate ADAMANT’s versatility, we plug different implementations of a CPU/GPU-based system (using OpenCL, OpenMP, & CUDA) and analyze their performance on TPC-H queries. We identify a 4x performance difference between an arbitrary chunked execution vs. a more architecturally conscious pipelined execution. Furthermore, our comparisons with HeavyDB show complex performance variations from speed-ups up to a factor of 2x from our hardware-conscious execution. We envision initiatives like ADAMANT to ease the study of complex optimizations required in co-processor systems, paving the way for efficient and portable data management tools without cutbacks.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126628528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Forecasting COVID-19 Dynamics: Clustering, Generalized Spatiotemporal Attention, and Impacts of Mobility and Geographic Proximity","authors":"Tong Shen, Yang Li, J. Moura","doi":"10.1109/ICDE55515.2023.00221","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00221","url":null,"abstract":"Forecasting the dynamics of COVID-19 enables government agencies and public health administrators to take proactive measures to combat the pandemic. This forecasting task faces several key challenges: First, the dynamics of COVID-19 exhibit complex spatial and temporal dependencies. The current growing trend at a location may be similar to that at another location in the past. Second, numerous factors, such as population mobility and geographic proximity between regions, mask usage, vaccine coverage, etc., significantly impact the dynamics. Third, we need to find the appropriate granularity for the forecasting task. The granularity should not be too coarse that we ignore the idiosyncrasies of individual regions. Still, the granularity should not be too fine that the prediction results are seriously vulnerable to noise.This paper addresses these challenges. We propose a simple but effective clustering algorithm that finds the appropriate granularity for the forecasting task. We invent generalized spatiotemporal attention, an attention mechanism that is generalized enough to capture the complex spatial and temporal dependencies and to flexibly account for intra- and inter-region characteristics such as geographic proximity and population mobility. Based on this generalized spatiotemporal attention, we designed COVID-Forecaster, a lightweight deep learning model for forecasting the dynamics of COVID-19. Experimental results demonstrate that COVID-Forecaster significantly outperforms state-of-the-art models. For example, COVID-Forecaster reduces the mean absolute percentage error (MAPE) by 6.8% and the weighted absolute percentage error (WAPE) by 13.5% in forecasting the COVID-19 dynamics at the 3141 counties of the United States.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123171160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangyu Zhang, Chun-hua Li, Ke Zhou, Li Liu, Ce Zhang, Wancheng Chen, Haotian Fang, Bin Cheng, Jie Yang, Jiashu Xing
{"title":"DBCatcher: A Cloud Database Online Anomaly Detection System based on Indicator Correlation","authors":"Guangyu Zhang, Chun-hua Li, Ke Zhou, Li Liu, Ce Zhang, Wancheng Chen, Haotian Fang, Bin Cheng, Jie Yang, Jiashu Xing","doi":"10.1109/ICDE55515.2023.00091","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00091","url":null,"abstract":"Anomaly detection system plays an important role in maintaining the stability of cloud database. Existing studies mainly focus on significant deviations in multivariate time series, such as a combination of CPU utilization, transactions per second, etc, to detect abnormal issues. Due to the complexity of cloud database structure and functions, these approaches are difficult to achieve a balance among detection performance, detection efficiency and workload adaptability. In this paper, we propose DBCatcher, a cloud database online anomaly detection system based on indicator correlation. Through extensive analysis of real-world cloud database time series, we find the correlations among trends in the same key performance indicators across databases within the same unit, which inspires us to explore a time series correlation measurement method that can efficiently detect abnormal issues. Meanwhile, we design a flexible time window observation mechanism and an adaptive threshold learning policy to minimize misjudgment caused by key performance indicator fluctuations, greatly enhancing the detection performance and workload adaptability. We conduct extensive experiments under real-world and synthetic workloads. Experimental results show that DBCatcher significantly improves the detection performance and detection efficiency compared to existing methods.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123271328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kangzheng Liu, Feng Zhao, Guandong Xu, Xianzhi Wang, Hai Jin
{"title":"RETIA: Relation-Entity Twin-Interact Aggregation for Temporal Knowledge Graph Extrapolation","authors":"Kangzheng Liu, Feng Zhao, Guandong Xu, Xianzhi Wang, Hai Jin","doi":"10.1109/ICDE55515.2023.00138","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00138","url":null,"abstract":"Temporal knowledge graph (TKG) extrapolation aims to predict future unknown events (facts) based on historical information, and has attracted considerable attention due to its great practical significance. Accurate representations (embeddings) of entities and relations form the basis of TKG extrapolation. Recent work has been devoted to improving the rationality of entity representations. However, on the one hand, ignoring relation modeling results in incomplete relation representations; therefore, some approaches aggregate only immediately adjacent entities of relations, but this can lead to the \"message islands\" problem of relation modeling. On the other hand, ignoring the association constraints between relations and entities can make the embeddings of both relations and entities prone to overfitting. To address the abovementioned challenges, we propose an advanced method, namely, RETIA. For the former issue, we generate twin hyperrelation subgraphs for each historical subgraph and then aggregate both the adjacent entities and relations in the hyperrelation subgraphs through a graph convolutional network (GCN). About the latter concern, we propose a twin-interact module (TIM), which provides communication channels for relation aggregation and entity aggregation during the evolution of the historical sequence. Experiments conducted on five public datasets show that RETIA has made great improvements across several evaluation metrics. Our released code is available at https://github.com/CGCL-codes/RETIA.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123120828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Derong Xu, Jingbo Zhou, Tong Xu, Yuan Xia, Ji Liu, Enhong Chen, D. Dou
{"title":"Multimodal Biological Knowledge Graph Completion via Triple Co-Attention Mechanism","authors":"Derong Xu, Jingbo Zhou, Tong Xu, Yuan Xia, Ji Liu, Enhong Chen, D. Dou","doi":"10.1109/ICDE55515.2023.10231041","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.10231041","url":null,"abstract":"Biological Knowledge Graphs (BKGs) can help to model complex biological systems in a structural way to support various tasks. Nevertheless, the incompleteness problem may limit the performance of existing BKGs, which still deserves new methods to reveal the missing relations. Though great efforts have been made to knowledge graph completion, existing methods are not easy to be adapted to the multimodal biological information such as molecular structures and textual descriptions. To this end, we propose a novel co-attention-based multimodal embedding framework, named CamE, for the multimodal BKG completion task. Specifically, we design a Triple Co-Attention (TCA) operator to capture and highlight the same semantic features among different modalities. Based on TCA, we further propose two components to handle multimodal fusion and multimodal entity-relation interaction, respectively. One is the multimodal TCA fusion module to achieve a multimodal joint representation for each entity in the BKG. It aims to project different modal information into a common space by capturing the same semantic features and overcoming the modality gap. The other is the relation-aware interactive TCA module to learn interactive representation by modelling the deep interaction between multimodal entities and relations. Extensive experiments on two real-world multimodal BKG datasets demonstrate that our method significantly outperforms several state-of-the-art baselines, including 10.3% and 16.2% improvement w.r.t MRR and Hits@1 metrics over its best competitors on public DRKG-MM dataset.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122292652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}