{"title":"A Novel Discriminative Dictionary Pair Learning Constrained by Ordinal Locality for Mixed Frequency Data Classification : Extended abstract","authors":"Hong Yu, Qianying Yang, Guoyin Wang, Yongfang Xie","doi":"10.1109/ICDE55515.2023.00349","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00349","url":null,"abstract":"A dilemma faced by classification is that the data is not collected at the same frequency in some applications. We investigate the mixed frequency data in a new way and recognize them as a special style of multi-view data, in which each view data is collected at a different sampling frequency. This paper proposes a discriminative dictionary pair learning method constrained by ordinal locality for mixed frequency data classification (shorted by DPLOL-MF). This method integrates synthesis dictionary and analysis dictionary into a dictionary pair, which not only improves computational cost caused by the ℓ0 or ℓ1-norm constraint, but also can deal with the sampling frequency inconsistency. The DPLOL-MF utilizes a synthesis dictionary to learn class-specified reconstruction information and employs an analysis dictionary to generate coding coefficients by analyzing samples. Particularly, the ordinal locality preserving term is leveraged to constrain the atoms of dictionaries pair to further facilitate the learned dictionary pair to be more discriminative. Besides, we design a specific classification scheme for the inconsistent sample size of mixed frequency data. This paper illustrates a novel idea to solve the classification task of mixed frequency data and the experimental results demonstrate the effectiveness of the proposed method.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134015516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GALE: Active Adversarial Learning for Erroneous Node Detection in Graphs","authors":"Sheng Guan, Hanchao Ma, Mengying Wang, Yinghui Wu","doi":"10.1109/ICDE55515.2023.00134","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00134","url":null,"abstract":"We introduce GALE, an active adversarial learning framework to detect nodes with erroneous information in attributed graphs. GALE is empowered by a new adversarial active error detection framework, which interacts active learning with a graph generative adversarial model to best exploit limited labeled examples of erroneous nodes. It dynamically determines diversified query nodes in batches with bounded size in terms of node typicality to enrich a pool of examples, which in turn provides representative examples to best train an adversarial classifier to capture different types of errors. Moreover, GALE provides an annotation algorithm to suggest a context of possible correct attribute values and error types, to facilitate the labeling of query nodes. We show that using limited queries and examples, GALE significantly improves competing methods such as constraint-based detection, outlier detection, and Graph Neural Networks (e.g. GCNs), with 32%, 31%, and 17% gain in F-1 score on average, and is feasible in learning cost for large graphs.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117306039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lixiang Chen, Ruihao Chen, Chengcheng Yang, Yuxing Han, Rong Zhang, Xuan Zhou, Peiquan Jin, Weining Qian
{"title":"Workload-Aware Log-Structured Merge Key-Value Store for NVM-SSD Hybrid Storage","authors":"Lixiang Chen, Ruihao Chen, Chengcheng Yang, Yuxing Han, Rong Zhang, Xuan Zhou, Peiquan Jin, Weining Qian","doi":"10.1109/ICDE55515.2023.00171","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00171","url":null,"abstract":"The log-structured merge tree (LSM-tree) has been widely adopted as a backbone of modern key-value stores. However, the multiple exponentially increased levels of LSM-tree makes it suffer from high write amplification. Existing studies often improve the write performance by sacrificing the read performance, which is inefficient to make trade-offs between the update and search efficiency. In this paper, we exploit nonvolatile memory (NVM) to address the write amplification issue for systems with NVM-SSD hybrid storage, and further propose a reinforcement learning method to navigate between update and search efficiency on the varying workloads. Specifically, we first propose a lightweight hot data identification method to efficiently capture access recency as well as frequency in NVM with relative large capacity. On this basis, we can eliminate different versions of frequently updated data in high-performance NVM without pushing them to SSD. To improve the data access locality and facilitate fine-grained index tuning in each level, we devise a virtual-split method to partition the key space gradually without extra write amplification. Finally, we propose a cost based Q-learning algorithm to adaptively tune the data organizations of each partition according to the changing access patterns. Experimental results show that our approach outperforms existing methods by up to 2.67×.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133024604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DGDFS: Dependence Guided Discriminative Feature Selection for Predicting Adverse Drug-Drug Interaction : Extended Abstract","authors":"Jiajing Zhu, Yongguo Liu, Chuanbiao Wen, Xindong Wu","doi":"10.1109/ICDE55515.2023.00347","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00347","url":null,"abstract":"Adverse drug-drug interaction (ADDI) is a significant life-threatening issue for public health. The current methods for ADDI prediction usually work in a \"nondiscriminatory\" manner by treating each feature without discrimination and equally employing all features into ADDI modeling. Driven by this issue, we propose a Dependence Guided Discriminative Feature Selection (DGDFS) model for ADDI prediction, in which molecular structure and side effect are adopted with the incorporation of l2,0-norm equality constraints to select discriminative molecular substructures and side effects and three dependence based terms among molecular structure, side effect, and ADDIs to guide feature selection. Extensive experiments demonstrate the superior performance of DGDFS compared with fourteen state-of-the-art ADDI prediction and feature selection models.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133592943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two-Sided Instant Incentive Optimization under a Shared Budget in Ride-Hailing Services","authors":"Junlin Chen, Xin Liu, Weidong Liu, Hai Jiang","doi":"10.1109/ICDE55515.2023.00267","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00267","url":null,"abstract":"Ride-hailing has become a popular service in recent years. For each ride-hailing request, after the platform determines the fare for the passenger and the commission for the driver, it is not uncommon for the platform to set aside a promotional budget and give instant incentives to both sides, that is, a discount to the passenger and a bonus to the driver, to further improve the match between the two sides. Although there is a proliferation of studies on the determination of the fare and the commission in ride-hailing services, they cannot address the instant incentive problem because their approaches do not deal with budget constraints. In this research, we investigate this two-sided instant incentive problem under a shared promotional budget, which is new to the literature. We formulate this problem as a binary integer linear programming problem, whose goal is to find the optimal incentives for both sides given predicted trip completion probabilities. We first assume that the predicted trip completion probabilities are accurate and develop a Lagrangian-dual-based approach to decompose the problem into a series of subproblems that can be efficiently solved. We then proceed to accommodate the inaccuracy in the predictions and develop a robust instant incentive optimization approach that exploits the prediction error reflected by historical data. We conduct numerical experiments using real data in the city of Nanjing from a leading ride-hailing platform in China. Results show that compared to the baseline approach: (i) Before we account for prediction inaccuracy, our solution approach can improve the number of completed requests by at most 8.30% with a decision error of 8.31%; and (ii) After we account for prediction inaccuracy, our solution approach can improve the number of completed requests by at most 9.81% while reducing the decision error to 7.03%.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127785614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CPiX: Real-Time Analytics Over Out-of-Order Data Streams by Incremental Sliding-Window Aggregation","authors":"Savong Bou, H. Kitagawa, T. Amagasa","doi":"10.1109/ICDE55515.2023.00310","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00310","url":null,"abstract":"Stream processing is used in various fields. In the field of big data, stream aggregation is a popular processing technique, but it suffers serious setbacks when the order of events (e.g., stream elements) occurring is different from the order of events arriving to the systems. Such data streams are called \"non-FIFO steams\". This phenomenon usually occurs in a distributed environment due to many factors, such as network disruptions, delays, etc. Many analyzing scenarios require efficient processing of such non-FIFO streams to meet various data processing requirements. This paper proposes an efficient scalable checkpoint-based bidirectional indexing approach, called CPiX , for faster real-time analysis over non-FIFO streams. CPiX maintains the partial aggregation results in an on-demand manner. CPiX needs less time and space than the state-of-the-art approach. Extensive experiments confirm that CPiX can deal with out-of-order streams very efficiently and is, on average, about 3.8 times faster than the state-of-the-art approach while consuming less memory. CPiX and the existing approaches support the distributive and algebraic aggregation functions, such as min, average, standard deviation, etc. Holistic aggregation is beyond the scope.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133744212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Nguyen, Chi Thang Duong, Hongzhi Yin, M. Weidlich, S. T. Mai, K. Aberer, Quoc Viet Hung Nguyen
{"title":"Efficient and Effective Multi-Modal Queries through Heterogeneous Network Embedding (Extended Abstract)","authors":"T. Nguyen, Chi Thang Duong, Hongzhi Yin, M. Weidlich, S. T. Mai, K. Aberer, Quoc Viet Hung Nguyen","doi":"10.1109/ICDE55515.2023.00322","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00322","url":null,"abstract":"Recent information retrieval (IR) systems answer a multi-modal query by considering it as a set of separate uni-modal queries. However, depending on the chosen operationalisation, such an approach is inefficient or ineffective. It either requires multiple passes over the data or leads to inaccuracies since the relations between data modalities are neglected in the relevance assessment. To mitigate these challenges, we present an IR system that has been designed to answer genuine multi-modal queries. It relies on a heterogeneous network embedding, so that features from diverse modalities can be incorporated when representing both, a query and the data over which it shall be evaluated. An experimental evaluation using diverse real-world and synthetic datasets illustrates that our approach returns twice the amount of relevant information compared to baseline techniques, while scaling to large multi-modal databases.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133822002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yannick Wilhelm, Peter Reimann, W. Gauchel, Steffen Klein, B. Mitschang
{"title":"Pusion - A Generic and Automated Framework for Decision Fusion","authors":"Yannick Wilhelm, Peter Reimann, W. Gauchel, Steffen Klein, B. Mitschang","doi":"10.1109/ICDE55515.2023.00252","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00252","url":null,"abstract":"Combining two or more classifiers into an ensemble and fusing the individual classifier decisions to a consensus decision can improve the accuracy for a classification problem. The classification improvement of the fusion result depends on numerous factors, such as the data set, the combination scenario, the decision fusion algorithm, as well as the prediction accuracies and diversity of the multiple classifiers to be combined. Due to these factors, the best decision fusion algorithm for a given decision fusion problem cannot be generally determined in advance. In order to support the user in combining classifiers and to achieve the best possible fusion result, we propose the PUSION (Python Universal fuSION) framework, a novel generic and automated framework for decision fusion of classifiers. The framework includes 14 decision fusion algorithms and covers a total of eight different combination scenarios for both multi-class and multi-label classification problems. The introduced concept of AutoFusion detects the combination scenario for a given use case, automatically selects the applicable decision fusion algorithms and returns the decision fusion algorithm that leads to the best fusion result. The framework is evaluated with two real-world case studies in the field of fault diagnosis. In both case studies, the consensus decision of multiple classifiers and heterogeneous fault diagnosis methods significantly increased the overall classification accuracy. Our evaluation results show that our framework is of practical relevance and reliably finds the best performing decision fusion algorithm for a given combination task.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115454594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Morteza Alipourlangouri, Adam Mansfield, Fei Chiang, Yinghui Wu
{"title":"Inconsistency Detection with Temporal Graph Functional Dependencies","authors":"Morteza Alipourlangouri, Adam Mansfield, Fei Chiang, Yinghui Wu","doi":"10.1109/ICDE55515.2023.00042","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00042","url":null,"abstract":"Data dependencies have been extended to graphs to characterize topological and value constraints. Existing data dependencies are defined to capture inconsistencies in static graphs. Nevertheless, inconsistencies may occur over evolving graphs and only for certain time periods. The need for capturing such inconsistencies in temporal graphs is evident in anomaly detection and predictive dynamic network analysis. This paper introduces a class of data dependencies called Temporal Graph Functional Dependencies (TGFDs). TGFDs generalize functional dependencies to temporal graphs as a sequence of graph snapshots that are induced by time intervals, and enforce both topological constraints and attribute value dependencies that must be satisfied by these snapshots. (1) We establish the complexity results for the satisfiability and implication problems of TGFDs. (2) We propose a sound and complete axiomatization system for TGFDs. (3) We also present efficient parallel algorithms to detect inconsistencies in temporal graphs as violations of TGFDs. The algorithm exploits data and temporal locality induced by time intervals, and uses incremental pattern matching and load balancing strategies to enable feasible error detection in large temporal graphs. Using real datasets, we experimentally verify that our algorithms achieve lower runtimes compared to existing baselines, while improving the accuracy over error detection using existing graph data constraints, e.g., GFDs and GTARs with 55% and 74% gain in F1-score, respectively.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115817124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Davide Magnanimi, Luigi Bellomarini, S. Ceri, D. Martinenghi
{"title":"Reactive Company Control in Company Knowledge Graphs","authors":"Davide Magnanimi, Luigi Bellomarini, S. Ceri, D. Martinenghi","doi":"10.1109/ICDE55515.2023.00256","DOIUrl":"https://doi.org/10.1109/ICDE55515.2023.00256","url":null,"abstract":"The Company Control Problem consists in understanding who exerts decision power in companies. Central banks, financial intelligence units, and market regulators are all interested in this problem, which is crucial for their core goals. In the context where these actors operate, changes in company control call for immediate reactions.Yet, computing control relationships is a computationally expensive problem that involves traversing the entire shareholding structure and aggregating shares over multiple paths.In the context of the joint European banking supervision, the Bank of Italy will soon handle the shareholding graph of all European companies, which comprises hundreds of millions of entities (firms and individuals) and billions of edges and properties. This graph is highly volatile as the Bank continuously receives updates about shareholding relationships with unpredictable high frequency. This makes the straightforward bulk solution, where all the company control relationships are computed and materialized whenever a change occurs, unaffordable in practice.In this work, we present an incremental rule-based formalization of the problem, adopting the Vadalog fragment of the Datalog+/- families of languages. Our approach analyzes the specific change, singles out the portions of the graph that are affected by it, and selectively updates them. This allows one both to timely evaluate the impact of ownership variations on an extensive European-scale shareholding graph and to enable economists to perform the so-called \"what-if analysis\", i.e., simulation scenarios to proactively study the consequences of potential share acquisition operations, that currently are prohibitively time expensive. We provide an extensive experimental evaluation on very large company graphs, comparatively confirming the scalability of our technique in a real production setting.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"324 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115843642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}