Yicheng Pan, Yifan Zhang, Xinrui Jiang, Meng Ma, Ping Wang
{"title":"EffCause: Discover Dynamic Causal Relationships Efficiently from Time-Series","authors":"Yicheng Pan, Yifan Zhang, Xinrui Jiang, Meng Ma, Ping Wang","doi":"10.1145/3640818","DOIUrl":"https://doi.org/10.1145/3640818","url":null,"abstract":"<p>Since the proposal of Granger causality, many researchers have followed the idea and developed extensions to the original algorithm. The classic Granger causality test aims to detect the existence of the static causal relationship. Notably, a fundamental assumption underlying most previous studies is the stationarity of causality, which requires the causality between variables to keep stable. However, this study argues that it is easy to break in real-world scenarios. Fortunately, our paper presents an essential observation: if we consider a sufficiently short window when discovering the rapidly changing causalities, they will keep approximately static and thus can be detected using the static way correctly. In light of this, we develop EffCause, bringing dynamics into classic Granger causality. Specifically, to efficiently examine the causalities on different sliding window lengths, we design two optimization schemes in EffCause and demonstrate the advantage of EffCause through extensive experiments on both simulated and real-world datasets. The results validate that EffCause achieves state-of-the-art accuracy in continuous causal discovery tasks while achieving faster computation. Case studies from cloud system failure analysis and traffic flow monitoring show that EffCause effectively helps us understand real-world time-series data and solve practical problems.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"7 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139477015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaobo Guo, Mingming Ha, Xuewen Tao, Shaoshuai Li, Youru Li, Zhenfeng Zhu, Zhiyong Shen, Li Ma
{"title":"Multi-Task Learning with Sequential Dependence Towards Industrial Applications: A Systematic Formulation","authors":"Xiaobo Guo, Mingming Ha, Xuewen Tao, Shaoshuai Li, Youru Li, Zhenfeng Zhu, Zhiyong Shen, Li Ma","doi":"10.1145/3640468","DOIUrl":"https://doi.org/10.1145/3640468","url":null,"abstract":"<p>Multi-task learning (MTL) is widely used in the online recommendation and financial services for multi-step conversion estimation, but current works often overlook the sequential dependence among tasks. Particularly, sequential dependence multi-task learning (SDMTL) faces challenges in dealing with complex task correlations and extracting valuable information in real-world scenarios, leading to negative transfer and a deterioration in the performance. Herein, a systematic learning paradigm of the SDMTL problem is established for the first time, which applies to more general multi-step conversion scenarios with longer conversion paths or various task dependence relationships. Meanwhile, an SDMTL architecture, named <b>T</b>ask <b>A</b>ware <b>F</b>eature <b>E</b>xtraction (<b>TAFE</b>), is designed to enable the dynamic task representation learning from a sample-wise view. TAFE selectively reconstructs the implicit shared information corresponding to each sample case and performs the explicit task-specific extraction under dependence constraints, which can avoid the negative transfer, resulting in more effective information sharing and joint representation learning. Extensive experiment results demonstrate the effectiveness and applicability of the proposed theoretical and implementation frameworks. Furthermore, the online evaluations at MYbank showed that TAFE had an average increase of 9.22(% ) and 3.76(% ) in various scenarios on the post-view click-through (& ) conversion rate (CTCVR) estimation task. Currently, TAFE has been depolyed in an online platform to provide various traffic services.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"93 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Asymmetric Learning for Graph Neural Network based Link Prediction","authors":"Kai-Lang Yao, Wu-Jun Li","doi":"10.1145/3640347","DOIUrl":"https://doi.org/10.1145/3640347","url":null,"abstract":"<p>Link prediction is a fundamental problem in many graph-based applications, such as protein-protein interaction prediction. Recently, graph neural network (GNN) has been widely used for link prediction. However, existing GNN-based link prediction (GNN-LP) methods suffer from scalability problem during training for large-scale graphs, which has received little attention from researchers. In this paper, we first analyze the computation complexity of existing GNN-LP methods, revealing that one reason for the scalability problem stems from their symmetric learning strategy in applying the same class of GNN models to learn representation for both head nodes and tail nodes. We then propose a novel method, called <underline>a</underline>sym<underline>m</underline>etric <underline>l</underline>earning (AML), for GNN-LP. More specifically, AML applies a GNN model to learn head node representation while applying a multi-layer perceptron (MLP) model to learn tail node representation. To the best of our knowledge, AML is the first GNN-LP method to adopt an asymmetric learning strategy for node representation learning. Furthermore, we design a novel model architecture and apply a row-wise mini-batch sampling strategy to ensure promising model accuracy and training efficiency for AML. Experiments on three real large-scale datasets show that AML is 1.7 × ∼ 7.3 × faster in training than baselines with a symmetric learning strategy while having almost no accuracy loss.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"14 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139410440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"diGRASS: Directed Graph Spectral Sparsification via Spectrum-Preserving Symmetrization","authors":"Ying Zhang, Zhiqiang Zhao, Zhuo Feng","doi":"10.1145/3639568","DOIUrl":"https://doi.org/10.1145/3639568","url":null,"abstract":"<p>Recent <i>spectral graph sparsification</i> research aims to construct ultra-sparse subgraphs for preserving the original graph spectral (structural) properties, such as the first few Laplacian eigenvalues and eigenvectors, which has led to the development of a variety of nearly-linear time numerical and graph algorithms. However, there is very limited progress for spectral sparsification of directed graphs. In this work, we prove the existence of nearly-linear-sized spectral sparsifiers for directed graphs under certain conditions. Furthermore, we introduce a practically-efficient spectral algorithm (diGRASS) for sparsifying real-world, large-scale directed graphs leveraging spectral matrix perturbation analysis. The proposed method has been evaluated using a variety of directed graphs obtained from real-world applications, showing promising results for solving directed graph Laplacians, spectral partitioning of directed graphs, and approximately computing (personalized) PageRank vectors.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"209 1 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139105578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Heterogeneous Knowledge Graph Completion with a Novel GAT-based Approach","authors":"Wanxu Wei, Yitong Song, Bin Yao","doi":"10.1145/3639472","DOIUrl":"https://doi.org/10.1145/3639472","url":null,"abstract":"<p>Knowledge graphs (KGs) play a vital role in enhancing search results and recommendation systems. With the rapid increase in the size of the KGs, they are becoming inaccuracy and incomplete. This problem can be solved by the knowledge graph completion methods, of which graph attention network (GAT)-based methods stand out since their superior performance. However, existing GAT-based knowledge graph completion methods often suffer from overfitting issues when dealing with heterogeneous knowledge graphs, primarily due to the unbalanced number of samples. Additionally, these methods demonstrate poor performance in predicting the tail (head) entity that shares the same relation and head (tail) entity with others. To solve these problems, we propose GATH, a novel <underline><b>GAT</b></underline>-based method designed for <underline><b>H</b></underline>eterogeneous KGs. GATH incorporates two separate attention network modules that work synergistically to predict the missing entities. We also introduce novel encoding and feature transformation approaches, enabling the robust performance of GATH in scenarios with imbalanced samples. Comprehensive experiments are conducted to evaluate the GATH’s performance. Compared with the existing SOTA GAT-based model on Hits@10 and MRR metrics, our model improves performance by 5.2% and 5.2% on the FB15K-237 dataset, and by 4.5% and 14.6% on the WN18RR dataset, respectively.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"72 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139105360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian Graph Local Extrema Convolution with Long-Tail Strategy for Misinformation Detection","authors":"Guixian Zhang, Shichao Zhang, Guan Yuan","doi":"10.1145/3639408","DOIUrl":"https://doi.org/10.1145/3639408","url":null,"abstract":"<p>It has become a cardinal task to identify fake information (misinformation) on social media because it has significantly harmed the government and the public. There are many spam bots maliciously retweeting misinformation. This study proposes an efficient model for detecting misinformation with self-supervised contrastive learning. A <b>B</b>ayesian graph <b>L</b>ocal extrema <b>C</b>onvolution (BLC) is first proposed to aggregate node features in the graph structure. The BLC approach considers unreliable relationships and uncertainties in the propagation structure, and the differences between nodes and neighboring nodes are emphasized in the attributes. Then, a new long-tail strategy for matching long-tail users with the global social network is advocated to avoid over-concentration on high-degree nodes in graph neural networks. Finally, the proposed model is experimentally evaluated with two publicly Twitter datasets and demonstrates that the proposed long-tail strategy significantly improves the effectiveness of existing graph-based methods in terms of detecting misinformation. The robustness of BLC has also been examined on three graph datasets and demonstrates that it consistently outperforms traditional algorithms when perturbed by 15% of a dataset.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"12 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oana Balalau, Francesco Bonchi, T-H. Hubert Chan, Francesco Gullo, Mauro Sozio, Hao Xie
{"title":"Finding Subgraphs with Maximum Total Density and Limited Overlap in Weighted Hypergraphs","authors":"Oana Balalau, Francesco Bonchi, T-H. Hubert Chan, Francesco Gullo, Mauro Sozio, Hao Xie","doi":"10.1145/3639410","DOIUrl":"https://doi.org/10.1145/3639410","url":null,"abstract":"<p>Finding dense subgraphs in large (hyper)graphs is a key primitive in a variety of real-world application domains, encompassing social network analytics, event detection, biology, and finance. In most such applications, one typically aims at finding several (possibly overlapping) dense subgraphs which might correspond to communities in social networks or interesting events. While a large amount of work is devoted to finding a single densest subgraph, perhaps surprisingly, the problem of finding several dense subgraphs in weighted hypergraphs with limited overlap has not been studied in a principled way, to the best of our knowledge. In this work we define and study a natural generalization of the densest subgraph problem in weighted hypergraphs, where the main goal is to find at most <i>k</i> subgraphs with maximum total aggregate density, while satisfying an upper bound on the pairwise weighted Jaccard coefficient, i.e., the ratio of weights of intersection divided by weights of union on two nodes sets of the subgraphs. After showing that such a problem is NP-Hard, we devise an efficient algorithm that comes with provable guarantees in some cases of interest, as well as, an efficient practical heuristic. Our extensive evaluation on large real-world hypergraphs confirms the efficiency and effectiveness of our algorithms.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"21 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139096492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinpeng Li, Hang Yu, zhenyuzhang, Xiangfeng Luo, Shaorong Xie
{"title":"Concept Drift Adaptation by Exploiting Drift Type","authors":"Jinpeng Li, Hang Yu, zhenyuzhang, Xiangfeng Luo, Shaorong Xie","doi":"10.1145/3638777","DOIUrl":"https://doi.org/10.1145/3638777","url":null,"abstract":"<p>Concept drift is a phenomenon where the distribution of data streams changes over time. When this happens, model predictions become less accurate. Hence, models built in the past need to be re-learned for the current data. Two design questions need to be addressed in designing a strategy to re-learn models: which type of concept drift has occurred, and how to utilize the drift type to improve re-learning performance. Existing drift detection methods are often good at determining when drift has occurred. However, few retrieve information about how the drift came to be present in the stream. Hence, determining the impact of the type of drift on adaptation is difficult. Filling this gap, we designed a framework based on a lazy strategy called Type-Driven Lazy Drift Adaptor (Type-LDA). Type-LDA first retrieves information about both how and when a drift has occurred, then it uses this information to re-learn the new model. To identify the type of drift, a drift type identifier is pre-trained on synthetic data of known drift types. Further, a drift point locator locates the optimal point of drift via a sharing loss. Hence, Type-LDA can select the optimal point, according to the drift type, to re-learn the new model. Experiments validate Type-LDA on both synthetic data and real-world data, and the results show that accurately identifying drift type can improve adaptation accuracy.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"23 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingyi Cui, Guangquan Xu, Jian Liu, Shicheng Feng, Jianli Wang, Hao Peng, Shihui Fu, Zhaohua Zheng, Xi Zheng, Shaoying Liu
{"title":"ID-SR: Privacy-Preserving Social Recommendation based on Infinite Divisibility for Trustworthy AI","authors":"Jingyi Cui, Guangquan Xu, Jian Liu, Shicheng Feng, Jianli Wang, Hao Peng, Shihui Fu, Zhaohua Zheng, Xi Zheng, Shaoying Liu","doi":"10.1145/3639412","DOIUrl":"https://doi.org/10.1145/3639412","url":null,"abstract":"<p>Recommendation systems powered by AI are widely used to improve user experience. However, it inevitably raises privacy leakage and other security issues due to the utilization of extensive user data. Addressing these challenges can protect users’ personal information, benefit service providers, and foster service ecosystems. Presently, numerous techniques based on differential privacy have been proposed to solve this problem. However, existing solutions encounter issues such as inadequate data utilization and an tenuous trade-off between privacy protection and recommendation effectiveness. To enhance recommendation accuracy and protect users’ private data, we propose ID-SR, a novel privacy-preserving social recommendation scheme for trustworthy AI based on the infinite divisibility of Laplace distribution. We first introduce a novel recommendation method adopted in ID-SR, which is established based on matrix factorization with a newly designed social regularization term for improving recommendation effectiveness. Additionally, we propose a differential privacy preserving scheme tailored to the above method that leverages the Laplace distribution’s characteristics to safeguard user data. Theoretical analysis and experimentation evaluation on two publicly available datasets demonstrate that our scheme achieves a superior balance between privacy protection and recommendation effectiveness, ultimately delivering an enhanced user experience.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"26 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139094844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinjiao Li, Guowei Wu, Lin Yao, Zhaolong Zheng, Shisong Geng
{"title":"Utility-aware Privacy Perturbation for Training Data","authors":"Xinjiao Li, Guowei Wu, Lin Yao, Zhaolong Zheng, Shisong Geng","doi":"10.1145/3639411","DOIUrl":"https://doi.org/10.1145/3639411","url":null,"abstract":"<p>Data perturbation under differential privacy constraint is an important approach of protecting data privacy. However, as the data dimensions increase, the privacy budget allocated to each dimension decreases and thus the amount of noise added increases, which eventually leads to lower data utility in training tasks. To protect the privacy of training data while enhancing data utility, we propose an Utility-aware training data Privacy Perturbation scheme based on attribute Partition and budget Allocation (UPPPA). UPPPA includes three procedures, the quantification of attribute privacy and attribute importance, attribute partition, and budget allocation. The quantification of attribute privacy and attribute importance based on information entropy and attribute correlation provide an arithmetic basis for attribute partition and budget allocation. During the attribute partition, all attributes of training data are classified into high and low classes to achieve privacy amplification and utility enhancement. During the budget allocation, a <i>γ</i>-privacy model is proposed to balance data privacy and data utility so as to provide privacy constraint and guide budget allocation. Three comprehensive sets of real-world data are applied to evaluate the performance of UPPPA. Experiments and privacy analysis show that our scheme can achieve the tradeoff between privacy and utility.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"2 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139096533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}