{"title":"Condensing Pre-Augmented Recommendation Data via Lightweight Policy Gradient Estimation","authors":"Jiahao Wu;Wenqi Fan;Jingfan Chen;Shengcai Liu;Qijiong Liu;Rui He;Qing Li;Ke Tang","doi":"10.1109/TKDE.2024.3484249","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3484249","url":null,"abstract":"Training recommendation models on large datasets requires significant time and resources. It is desired to construct concise yet informative datasets for efficient training. Recent advances in dataset condensation show promise in addressing this problem by synthesizing small datasets. However, applying existing methods of dataset condensation to recommendation has limitations: (1) they fail to generate discrete user-item interactions, and (2) they could not preserve users’ potential preferences. To address the limitations, we propose a lightweight condensation framework tailored for recommendation (\u0000<bold>DConRec</b>\u0000), focusing on condensing user-item historical interaction sets. Specifically, we model the discrete user-item interactions via a probabilistic approach and design a pre-augmentation module to incorporate the potential preferences of users into the condensed datasets. While the substantial size of datasets leads to costly optimization, we propose a lightweight policy gradient estimation to accelerate the data synthesis. Experimental results on multiple real-world datasets have demonstrated the effectiveness and efficiency of our framework. Besides, we provide a theoretical analysis of the provable convergence of DConRec.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"162-173"},"PeriodicalIF":8.9,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hybrid Cost Modeling for Reducing Query Performance Regression in Index Tuning","authors":"Wentao Wu","doi":"10.1109/TKDE.2024.3484954","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3484954","url":null,"abstract":"Autonomous index tuning (“auto-indexing” for short) has recently started being supported by cloud database service providers. Index tuners rely on query optimizer's cost estimates to recommend indexes that can minimize the execution cost of an input workload. Such cost estimates can often be erroneous that lead to significant query performance regression. To reduce the chance of regression, existing work primarily uses machine learning (ML) technologies to build prediction models to improve query execution cost estimation using actual query execution telemetry as training data. However, training data collection is typically an expensive process, especially for index tuning due to the significant overhead of creating/dropping indexes. As a result, the amount of training data can be limited in auto-indexing for cloud databases. In this paper, we propose a new approach named “hybrid cost modeling” to address this challenge. The key idea is to limit the ML-based modeling effort to the \u0000<italic>leaf operators</i>\u0000 such as table scans, index scans, and index seeks, and then combine the ML-model predicted costs of the leaf operators with optimizer's estimated costs of the other operators in the query plan. We conduct theoretical study as well as empirical evaluation to demonstrate the efficacy of applying hybrid cost modeling to index tuning, using both industrial benchmarks and real workloads.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"379-391"},"PeriodicalIF":8.9,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis","authors":"Zezhi Shao;Fei Wang;Yongjun Xu;Wei Wei;Chengqing Yu;Zhao Zhang;Di Yao;Tao Sun;Guangyin Jin;Xin Cao;Gao Cong;Christian S. Jensen;Xueqi Cheng","doi":"10.1109/TKDE.2024.3484454","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3484454","url":null,"abstract":"Multivariate Time Series (MTS) analysis is crucial to understanding and managing complex systems, such as traffic and energy systems, and a variety of approaches to MTS forecasting have been proposed recently. However, we often observe inconsistent or seemingly contradictory performance findings across different studies. This hinders our understanding of the merits of different approaches and slows down progress. We address the need for means of assessing MTS forecasting proposals reliably and fairly, in turn enabling better exploitation of MTS as seen in different applications. Specifically, we first propose BasicTS+, a benchmark designed to enable fair, comprehensive, and reproducible comparison of MTS forecasting solutions. BasicTS+ establishes a unified training pipeline and reasonable settings, enabling an unbiased evaluation. Second, we identify the heterogeneity across different MTS as an important consideration and enable classification of MTS based on their temporal and spatial characteristics. Disregarding this heterogeneity is a prime reason for difficulties in selecting the most promising technical directions. Third, we apply BasicTS+ along with rich datasets to assess the capabilities of more than 30 MTS forecasting solutions. This provides readers with an overall picture of the cutting-edge research on MTS forecasting.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"291-305"},"PeriodicalIF":8.9,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhong Yuan;Peng Hu;Hongmei Chen;Yingke Chen;Qilin Li
{"title":"DFNO: Detecting Fuzzy Neighborhood Outliers","authors":"Zhong Yuan;Peng Hu;Hongmei Chen;Yingke Chen;Qilin Li","doi":"10.1109/TKDE.2024.3484448","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3484448","url":null,"abstract":"Outlier Detection (OD) has attracted extensive research due to its application in many fields. The idea of neighborhood computing is one of the widely used methods in outlier analysis. Nevertheless, these methods mainly use certainty strategies to model outlier detection, so they cannot effectively handle the fuzzy information in the dataset. Moreover, they mainly focus on dealing with outlier detection in numerical data and cannot effectively find outliers in mixed-attribute data. Fuzzy information granulation theory is an effective granular computing model that allows objects to belong to a set to a certain extent (i.e., membership degree), which makes it possible to better handle uncertainty problems such as fuzziness. In this work, we propose an outlier detection model based on fuzzy neighborhoods. First, a hybrid fuzzy similarity is constructed to granulate the set of objects to form fuzzy information granules. Second, the fuzzy \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-nearest neighbor is defined to describe the fuzzy local information. Then, the fuzzy neighborhood density is defined to indicate the degree of aggregation of each object. The smaller the fuzzy neighborhood density of an object, the more likely it is to be an outlier. Based on this idea, the fuzzy neighborhood deviation degree is defined to quantify the degree of outliers of objects. Finally, the fuzzy deviation degree on the set of conditional attributes is constructed to indicate the outlier scores of objects. Experimental comparisons with state-of-the-art methods show that the proposed method has a significant improvement on the AUC index and applies to three types of data.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"200-209"},"PeriodicalIF":8.9,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CausalFormer: An Interpretable Transformer for Temporal Causal Discovery","authors":"Lingbai Kong;Wengen Li;Hanchen Yang;Yichao Zhang;Jihong Guan;Shuigeng Zhou","doi":"10.1109/TKDE.2024.3484461","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3484461","url":null,"abstract":"Temporal causal discovery is a crucial task aimed at uncovering the causal relations within time series data. The latest temporal causal discovery methods usually train deep learning models on prediction tasks to uncover the causality between time series. They capture causal relations by analyzing the parameters of some components of the trained models, e.g., attention weights and convolution weights. However, this is an incomplete mapping process from the model parameters to the causality and fails to investigate the other components, e.g., fully connected layers and activation functions, that are also significant for causal discovery. To facilitate the utilization of the whole deep learning models in temporal causal discovery, we proposed an interpretable transformer-based causal discovery model termed CausalFormer, which consists of the causality-aware transformer and the decomposition-based causality detector. The causality-aware transformer learns the causal representation of time series data using a prediction task with the designed multi-kernel causal convolution which aggregates each input time series along the temporal dimension under the temporal priority constraint. Then, the decomposition-based causality detector interprets the global structure of the trained causality-aware transformer with the proposed regression relevance propagation to identify potential causal relations and finally construct the causal graph. Experiments on synthetic, simulated, and real datasets demonstrate the state-of-the-art performance of CausalFormer on discovering temporal causality.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"102-115"},"PeriodicalIF":8.9,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Derivative Topic Dissemination Model Based on Representation Learning and Topic Relevance","authors":"Qian Li;Yunpeng Xiao;Xinming Zhou;Rong Wang;Sirui Duan;Xiang Yu","doi":"10.1109/TKDE.2024.3484496","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3484496","url":null,"abstract":"In social networks, topics often demonstrate a “fission” trend, where new topics arise from existing ones. Effectively predicting collective behavioral patterns during the dissemination of derivative topics is crucial for public opinion management. Addressing the symbiotic, antagonistic nature of “native-derived” topics, a derivative topic propagation model based on representation learning, topic relevance is proposed herein. First, considering the transition in user interest levels, cognitive accumulation at different evolutionary stages of native-derivative topics, a user content representation method, namely DTR2vec, is introduced, based on topic-related feature associations, for learning user content features. Then, evolutionary game theory is introduced by recognizing the symbiotic, antagonistic nature of “native-derived” topics during their propagation. Moreover, implicit relationships between users are explored, user influence is quantified for learning user structural features. Finally, considering the graph convolutional network’s ability to process non-euclidean structured data, the proposed model integrates user content, structural features to predict user forwarding behavior. Experimental results indicate that the proposed model not only effectively predicts the dissemination trends of derivative topics but also more authentically reflects the association, game relationships between native, derivative topics during their dissemination.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"7468-7482"},"PeriodicalIF":8.9,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Contrastive Multi-View Subspace Clustering With Representation and Cluster Interactive Learning","authors":"Xuejiao Yu;Yi Jiang;Guoqing Chao;Dianhui Chu","doi":"10.1109/TKDE.2024.3484161","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3484161","url":null,"abstract":"Multi-view clustering is an important approach to mining the valuable information within multi-view data. In this paper, we propose a novel multi-view deep subspace clustering method based on contrastive learning and Cauchy-Schwarz (CS) divergence. Our method not only uses contrastive learning techniques and block diagonalization constraints to guide representation matrix learning, but also combines representation learning and clustering processes to achieve the interaction of representation and clustering. First, we introduce a novel loss function based on CS divergence in the clustering module to achieve the interaction of representation and clustering. Second, we propose an extension of the multiple positive and negative pair diffusion method to enhance contrastive learning. Finally, we establish the equivalence between contrastive clustering and spectral clustering with orthogonal constraints, leading to a comprehensive model optimization. We evaluate our method on six publicly available datasets and compare its performance with eight competing methods. The results demonstrate the superiority of our method over the compared multi-view clustering methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"188-199"},"PeriodicalIF":8.9,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Yu;Ke Liang;Dayu Hu;Wenxuan Tu;Chuan Ma;Sihang Zhou;Xinwang Liu
{"title":"GZOO: Black-Box Node Injection Attack on Graph Neural Networks via Zeroth-Order Optimization","authors":"Hao Yu;Ke Liang;Dayu Hu;Wenxuan Tu;Chuan Ma;Sihang Zhou;Xinwang Liu","doi":"10.1109/TKDE.2024.3483274","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3483274","url":null,"abstract":"The ubiquity of Graph Neural Networks (GNNs) emphasizes the imperative to assess their resilience against node injection attacks, a type of evasion attacks that impact victim models by injecting nodes with fabricated attributes and structures. However, prevailing attacks face two primary limitations: (1) Sequential construction of attributes and structures results in suboptimal outcomes as structure information is overlooked during attribute construction and vice versa. (2) In black-box scenarios, where attackers lack access to victim model architecture and parameters, reliance on surrogate models degrades performance due to architectural discrepancies. To overcome these limitations, we introduce GZOO, a black-box node injection attack that leverages an adversarial graph generator, compromising both attribute and structure sub-generators. This integration crafts optimal attributes and structures by considering their mutual information, enhancing their influence when aggregating information from injected nodes. Furthermore, GZOO proposes a zeroth-order optimization algorithm leveraging prediction results from victim models to estimate gradients for updating generator parameters, eliminating the necessity to train surrogate models. Across sixteen datasets, GZOO significantly outperforms state-of-the-art attacks, achieving remarkable effectiveness and robustness. Notably, on the Cora dataset with the GCN model, GZOO achieves an impressive 95.69% success rate, surpassing the maximum 66.01% achieved by baselines.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"319-333"},"PeriodicalIF":8.9,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Iterative Soft Prompt-Tuning for Unsupervised Domain Adaptation","authors":"Yi Zhu;Shuqin Wang;Jipeng Qiang;Xindong Wu","doi":"10.1109/TKDE.2024.3483903","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3483903","url":null,"abstract":"Unsupervised domain adaptation aims to facilitate learning tasks in unlabeled target domain with knowledge in the related source domain, which has achieved awesome performance with the pre-trained language models (PLMs). Recently, inspired by GPT, the prompt-tuning model has been widely explored in stimulating rich knowledge in PLMs for language understanding. However, existing prompt-tuning methods still directly applied the model that was learned in the source domain into the target domain to minimize the discrepancy between different domains, e.g., the prompts or the template are trained separately to learn embeddings for transferring to the target domain, which is actually the intuition of end-to-end deep-based approach. In this paper, we propose an Iterative Soft Prompt-Tuning method (ItSPT) for better unsupervised domain adaptation. On the one hand, the prompt-tuning model learned in the source domain is converted into an iterative model to find the true label information in the target domain, the domain adaptation method is then regarded as a few-shot learning task. On the other hand, instead of hand-crafted templates, ItSPT adopts soft prompts for both considering the automatic template generation and classification performance. Experiments on both English and Chinese datasets demonstrate that our method surpasses the performance of SOTA methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"8580-8592"},"PeriodicalIF":8.9,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Is Sharing Neighbor Generator in Federated Graph Learning Safe?","authors":"Liuyi Yao;Zhen Wang;Yuexiang Xie;Yaliang Li;Weirui Kuang;Daoyuan Chen;Bolin Ding","doi":"10.1109/TKDE.2024.3482448","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3482448","url":null,"abstract":"Nowadays, as privacy concerns continue to rise, federated graph learning (FGL) which generalizes the classic federated learning to graph data has attracted increasing attention. However, while the focus has been on designing collaborative learning algorithms, the potential risks of privacy leakage through the sharing of necessary graph-related information in FGL, such as node embeddings and neighbor generators, have been largely neglected. In this paper, we verify the potential risks of privacy leakage in FGL, and provide insights about the cautions in FGL algorithm design. Specifically, we propose a novel privacy attack algorithm named Privacy Attack on federated Graph learning (PAG) towards reconstructing participants’ private node attributes and the linkage relationships. The participant performing the PAG attack is able to reconstruct the node attributes of the victim by matching the received gradients of the generator, and then train a link prediction model based on its local sub-graph to inductively infer the linkages connected to these reconstructed nodes. We theoretically and empirically demonstrate that under PAG attack, directly sharing the neighbor generators makes the FGL vulnerable to the data reconstruction attack. Furthermore, an investigation into the key factors that can hinder the success of the PAG attack provides insights into corresponding defense strategies and inspires future research into privacy-preserving FGL.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"8568-8579"},"PeriodicalIF":8.9,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}