Xin Liu;Yuxiang Zhang;Meng Wu;Mingyu Yan;Kun He;Wei Yan;Shirui Pan;Xiaochun Ye;Dongrui Fan
{"title":"Revisiting Edge Perturbation for Graph Neural Network in Graph Data Augmentation and Attack","authors":"Xin Liu;Yuxiang Zhang;Meng Wu;Mingyu Yan;Kun He;Wei Yan;Shirui Pan;Xiaochun Ye;Dongrui Fan","doi":"10.1109/TKDE.2025.3565306","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3565306","url":null,"abstract":"Edge perturbation is a basic method to modify graph structures. It can be categorized into two veins based on their effects on the performance of graph neural networks (GNNs), i.e., graph data augmentation and attack. Surprisingly, both veins of edge perturbation methods employ the same operations, yet yield opposite effects on GNNs’ accuracy. A distinct boundary between these methods in using edge perturbation has never been clearly defined. Consequently, inappropriate perturbations may lead to undesirable outcomes, necessitating precise adjustments to achieve desired effects. Therefore, questions of “why edge perturbation has a two-faced effect?” and “what makes edge perturbation flexible and effective?” still remain unanswered. In this paper, we will answer these questions by proposing a unified formulation and establishing a quantizable boundary between two categories of edge perturbation methods. Specifically, we conduct experiments to elucidate the differences and similarities between these methods and theoretically unify the workflow of these methods by casting it to one optimization problem. Then, we devise Edge Priority Detector (EPD) to generate a novel priority metric, bridging these methods up in the workflow. Experiments show that EPD can make augmentation or attack flexibly and achieve comparable or superior performance to other counterparts with less time overhead.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4225-4238"},"PeriodicalIF":8.9,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144232036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain Adaptation via Learning Using Statistical Invariant","authors":"Chunna Li;Yiwei Song;Yuan-Hai Shao","doi":"10.1109/TKDE.2025.3565780","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3565780","url":null,"abstract":"Domain adaptation has found widespread applications in real-life scenarios, especially when the target domain has limited labeled samples. However, most of the domain adaptation models only utilize one type of knowledge from the source domain, which is usually achieved by strong mode of convergence. To fully incorporate multiple knowledge from the source domain, for binary classification, this paper studies a novel learning paradigm for Domain Adaptation via Learning Using Statistical Invariant by simultaneously combining the strong and weak modes of convergence in a Hilbert space. The strong mode of convergence undertakes the mission of learning a least squares probability output binary classification task in a general hypothesis space, while the weak mode of convergence integrates diverse knowledge by constructing meaningful statistical invariants that embody the concept of intelligence. The utilization of weak convergence shrinks the admissible set of approximation functions, and subsequently accelerates the learning process. In this paper, several statistical invariants that represent sample, feature and parameter information from the source domain are constructed. By taking an appropriate statistical invariant, DLUSI realizes some existing methods. Experimental results on synthetic data as well as the widely used Amazon Reviews and 20 News data demonstrate the superiority of the proposed method.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4023-4034"},"PeriodicalIF":8.9,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Min Yang;Xiaoyu Li;Bin Xu;Xiushan Nie;Muming Zhao;Chengqi Zhang;Yu Zheng;Yongshun Gong
{"title":"STDA: Spatio-Temporal Deviation Alignment Learning for Cross-City Fine-Grained Urban Flow Inference","authors":"Min Yang;Xiaoyu Li;Bin Xu;Xiushan Nie;Muming Zhao;Chengqi Zhang;Yu Zheng;Yongshun Gong","doi":"10.1109/TKDE.2025.3565504","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3565504","url":null,"abstract":"Fine-grained urban flow inference (FUFI) is crucial for traffic management, as it infers high-resolution urban flow maps from coarse-grained observations. Existing FUFI methods typically focus on a single city and rely on comprehensive training with large-scale datasets to achieve precise inferences. However, data availability in developing cities may be limited, posing challenges to the development of well-performing models. To address this issue, we propose cross-city fine-grained urban flow inference, which aims to transfer spatio-temporal knowledge from data-rich cities to data-scarce areas using meta-transfer learning. This paper devises a <bold>S</b>patio-<bold>T</b>emporal <bold>D</b>eviation <bold>A</b>lignment (STDA) framework to mitigate spatio-temporal distribution deviations and urban structural deviations between multiple source cities and the target city. Furthermore, STDA presents a cross-city normalization method that adaptively combines batch and instance normalization to maintain consistency between city-variant and city-invariant features. Besides, we design an urban structure alignment module to align spatial topological differences across cities. STDA effectively reduces distribution and structural deviations among different datasets while avoiding negative transfer. Extensive experiments conducted on three real-world datasets demonstrate that STDA consistently outperforms state-of-the-art baselines.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 8","pages":"4833-4845"},"PeriodicalIF":8.9,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144573002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TELEX: Two-Level Learned Index for Rich Queries on Enclave-Based Blockchain Systems","authors":"Haotian Wu;Yuzhe Tang;Zhaoyan Shen;Jun Tao;Chenhao Lin;Zhe Peng","doi":"10.1109/TKDE.2025.3564905","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3564905","url":null,"abstract":"Blockchain has become a popular paradigm for secure and immutable data storage. Despite its numerous applications across various fields, concerns regarding the user privacy and result integrity during data queries persist. Additionally, the need for rich query functionalities to harness the full potential of blockchain data remains an area ripe for exploration. In order to address these challenges, our paper first utilizes a framework based on the Trusted Execution Environment (TEE) and oblivious RAM technique to achieve both privacy and data integrity. To enhance the query efficiency over the entire blockchain, we then devise a two-level learned indexing methodology named TELEX within the TEE for both integer and string keys. We also propose different query processing algorithms for versatile query types, including exact queries, aggregate queries, Boolean queries, and range queries. By implementing the prototype and conducting extensive evaluation, we demonstrate the feasibility and remarkable improvement in efficiency compared to existing solutions.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4299-4313"},"PeriodicalIF":8.9,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144232028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kenny Ye Liang;Yunxiang Su;Shaoxu Song;Chunping Li
{"title":"Turn Waste Into Wealth: On Efficient Clustering and Cleaning Over Dirty Data","authors":"Kenny Ye Liang;Yunxiang Su;Shaoxu Song;Chunping Li","doi":"10.1109/TKDE.2025.3564313","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3564313","url":null,"abstract":"Dirty data commonly exist. Simply discarding a large number of inaccurate points (as noises) could greatly affect clustering results. We argue that dirty data can be repaired and utilized as strong supports in clustering. To this end, we study a novel problem of clustering and repairing over dirty data at the same time. Referring to the minimum change principle in data repairing, the objective is to find a minimum modification of inaccurate points such that the large amount of dirty data can enhance clustering. We show that the problem is <sc>np</small>-hard and can be formulated as an integer linear programming (<sc>ilp</small>) problem. A constant factor approximation algorithm <sc>gdorc</small> is devised based on grid, with high efficiency. In experiments, <sc>gdorc</small> has great repairing and clustering results with low time consumption. Empirical results demonstrate that <italic>both the clustering and cleaning accuracies</i> can be improved by our approach of repairing and utilizing the dirty data in clustering.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4361-4372"},"PeriodicalIF":8.9,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144232133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaobin Rui;Zhixiao Wang;Hao Peng;Wei Chen;Philip S. Yu
{"title":"A Scalable Algorithm for Fair Influence Maximization With Unbiased Estimator","authors":"Xiaobin Rui;Zhixiao Wang;Hao Peng;Wei Chen;Philip S. Yu","doi":"10.1109/TKDE.2025.3564283","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3564283","url":null,"abstract":"This paper studies the fair influence maximization problem with efficient algorithms. In particular, given a graph <inline-formula><tex-math>$G$</tex-math></inline-formula>, a community structure <inline-formula><tex-math>${mathcal {C}}$</tex-math></inline-formula> consisting of disjoint communities, and a budget <inline-formula><tex-math>$k$</tex-math></inline-formula>, the problem asks to select a seed set <inline-formula><tex-math>$S$</tex-math></inline-formula> (<inline-formula><tex-math>$|S|=k$</tex-math></inline-formula>) that maximizes the influence spread while narrowing the influence gap between different communities. This problem derives from some significant social scenarios, such as health interventions (e.g. suicide/HIV prevention) where individuals from underrepresented groups or LGBTQ communities may be disproportionately excluded from the benefits of the intervention. To depict the concept of fairness in the context of influence maximization, researchers have proposed various notions of fairness, where the welfare fairness notion that better balances fairness level and influence spread has shown promising effectiveness. However, the lack of efficient algorithms for optimizing the objective function under welfare fairness restricts its application to networks of only a few hundred nodes. In this paper, we modify the objective function of welfare fairness to maximize the exponentially weighted sum and the logarithmically weighted sum over all communities’ influenced fractions (utility). To achieve efficient algorithms with theoretical guarantees, we first introduce two unbiased estimators: one for the fractional power of the arithmetic mean and the other for the logarithm of the arithmetic mean. Then, by adapting the Reverse Influence Sampling (RIS) approach, we convert the optimization problem to a weighted maximum coverage problem. We also analyze the number of reverse reachable sets needed to approximate the fair influence at a high probability. Finally, we present an efficient algorithm that guarantees <inline-formula><tex-math>$1-1/e - varepsilon$</tex-math></inline-formula> (positive objective function) or <inline-formula><tex-math>$1+1/e + varepsilon$</tex-math></inline-formula> (negative objective function) approximation for any small <inline-formula><tex-math>$varepsilon > 0$</tex-math></inline-formula>. Experiments demonstrate that our proposed algorithm could efficiently handle large-scale networks with good performance.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"3881-3895"},"PeriodicalIF":8.9,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Synthesis Reinvented: Preserving Missing Patterns for Enhanced Analysis","authors":"Xinyue Wang;Hafiz Asif;Shashank Gupta;Jaideep Vaidya","doi":"10.1109/TKDE.2025.3563319","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3563319","url":null,"abstract":"Synthetic data is being widely used as a replacement or enhancement for real data in fields as diverse as healthcare, telecommunications, and finance. Unlike real data, which represents actual people and objects, synthetic data is generated from an estimated distribution that retains key statistical properties of the real data. This makes synthetic data attractive for sharing while addressing privacy, confidentiality, and autonomy concerns. Real data often contains missing values that hold important information about individual, system, or organizational behavior. Standard synthetic data generation methods eliminate missing values as part of their pre-processing steps and thus completely ignore this valuable source of information. Instead, we propose methods to generate synthetic data that preserve both the observable and missing data distributions; consequently, retaining the valuable information encoded in the missing patterns of the real data. Our approach handles various missing data scenarios and can easily integrate with existing data generation methods. Extensive empirical evaluations on diverse datasets demonstrate the effectiveness of our approach as well as the value of preserving missing data distribution in synthetic data.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"3962-3975"},"PeriodicalIF":8.9,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Local Community Detection Method Based on Folded Subgraph","authors":"Mengting Zhang;Weihong Bi","doi":"10.1109/TKDE.2025.3563100","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3563100","url":null,"abstract":"Community structure refers to the “small groups” in the network. Detecting community structure in networks has significant application value. With the continuous expansion and complexity of the network, the global information of the network is often difficult to obtain. On the other hand, in some cases, we pay more attention to the local community where the given node is located. Local community detection methods detect local community structure by using local information from a given node. However, many local community detection methods encounter the problem of precision limitation. Therefore, in order to alleviate such problems, we propose the FG-based method in this paper. Based on the characteristics of complex networks, a folded subgraph method is designed to consider some similar nodes as single nodes, reducing the impact of noise in the network. Furthermore, based on the folded subgraph, FG-based method designs a three-stage local expansion strategy, in which nodes with different characteristics are added to the local community in each stage. We conduct experiments on datasets and find that the FG-based method can improve the recall and precision of local community structures.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"3869-3880"},"PeriodicalIF":8.9,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-Labeling and Self-Knowledge Distillation Unsupervised Feature Selection","authors":"Yunzhi Ling;Feiping Nie;Weizhong Yu;Xuelong Li","doi":"10.1109/TKDE.2025.3561046","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3561046","url":null,"abstract":"This paper proposes a deep pseudo-label method for unsupervised feature selection, which learns non-linear representations to generate pseudo-labels and trains a Neural Network (NN) to select informative features via self-Knowledge Distillation (KD). Specifically, the proposed method divides a standard NN into two sub-components: an encoder and a predictor, and introduces a dependency subnet. It works by self-supervised pre-training the encoder to produce informative representations and then alternating between two steps: (1) learning pseudo-labels by combining the clustering results of the encoder's outputs with the NN's prediction outputs, and (2) updating the NN's parameters by globally selecting a subset of features to predict the pseudo-labels while updating the subnet's parameters through self-KD. Self-KD is achieved by encouraging the subnet to locally capture a subset of the NN features to produce class probabilities that match those produced by the NN. This allows the model to self-absorb the learned inter-class knowledge and evaluate feature diversity, removing redundant features without sacrificing performance. Meanwhile, the potential discriminative capability of a NN can also be self-excavated without the assistance of other NNs. The two alternate steps reinforce each other: in step (2), by predicting the learned pseudo-labels and conducting self-KD, the discrimination of the outputs of both the NN and the encoder is gradually enhanced, while the self-labeling method in step (1) leverages these two improvements to further refine the pseudo-labels for step (2), resulting in the superior performance. Extensive experiments show the proposed method significantly outperforms state-of-the-art methods across various datasets.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4270-4284"},"PeriodicalIF":8.9,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144232027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongbo Yu;Zhoumin Lu;Feiping Nie;Weizhong Yu;Zongcheng Miao;Xuelong Li
{"title":"Pseudo-Label Guided Bidirectional Discriminative Deep Multi-View Subspace Clustering","authors":"Yongbo Yu;Zhoumin Lu;Feiping Nie;Weizhong Yu;Zongcheng Miao;Xuelong Li","doi":"10.1109/TKDE.2025.3562723","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3562723","url":null,"abstract":"In practical applications, multi-view subspace clustering is hindered by data noise that disrupts the ideal block-diagonal structure of self-representation matrices, thereby degrading performance. Moreover, many existing methods rely solely on sample features, overlooking the valuable structural information in affinity matrices (e.g., pairwise relationships). While conventional contrastive learning strategies often introduce false negative pairs due to noise and unreliable sample selection. To address these challenges, we propose a pseudo-label guided bidirectional discriminative deep multi-view subspace clustering method (PBDMSC). Our approach first employs pseudo-label guided contrastive learning, using previous cluster assignments to select reliable positive and negative samples, which mitigates incorrect pairings and enhances low-dimensional representations. Then, a discriminative self-representation learning method is introduced that leverages pseudo-labels to enforce homogeneous expression constraints and incorporates a bidirectional attention mechanism to preserve the structured information from affinity matrices, thereby enhancing robustness. Experimental results on six real-world datasets demonstrate that our proposed method achieves state-of-the-art clustering performance.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4213-4224"},"PeriodicalIF":8.9,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144232123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}