Xuan Zheng;Yihang Lu;Rong Wang;Feiping Nie;Xuelong Li
{"title":"Structured Graph-Based Ensemble Clustering","authors":"Xuan Zheng;Yihang Lu;Rong Wang;Feiping Nie;Xuelong Li","doi":"10.1109/TKDE.2025.3546502","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3546502","url":null,"abstract":"Ensemble clustering can utilize the complementary information among multiple base clusterings, and obtain a clustering model with better performance and more robustness. Despite its great success, there are still two problems in the current ensemble clustering methods. First, most ensemble clustering methods often treat all base clusterings equally. Second, the final ensemble clustering result often relies on <inline-formula><tex-math>$k$</tex-math></inline-formula>-means or other discretization procedures to uncover the clustering indicators, thus obtaining unsatisfactory results. To address these issues, we proposed a novel ensemble clustering method based on structured graph learning, which can directly extract clustering indicators from the obtained similarity matrix. Moreover, our methods take sufficient consideration of correlation among the base clusterings and can effectively reduce the redundancy among them. Extensive experiments on artificial and real-world datasets demonstrate the efficiency and effectiveness of our methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3728-3738"},"PeriodicalIF":8.9,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Ji;Kai Zhong;Luming Sun;Yiyan Li;Cuiping Li;Hong Chen
{"title":"LIOF: Make the Learned Index Learn Faster With Higher Accuracy","authors":"Tao Ji;Kai Zhong;Luming Sun;Yiyan Li;Cuiping Li;Hong Chen","doi":"10.1109/TKDE.2025.3548298","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3548298","url":null,"abstract":"Learned indexes, emerging as a promising alternative to traditional indexes like B+Tree, utilize machine learning models to enhance query performance and reduce memory usage. However, the widespread adoption of learned indexes is limited by their expensive training cost and the need for high accuracy of internal models. Although some studies attempt to optimize the building process of these learned indexes, existing methods are restrictive in scope and applicability. They are usually tailored to specific index types and heavily rely on pre-trained model knowledge, making deployment a challenging task. In this work, we introduce the Learned Index Optimization Framework (LIOF), a general and easily integrated solution aimed at expediting the training process and improving the accuracy of index model for one-dimensional and multi-dimensional learned indexes. The optimization of LIOF for the learned indexes is intuitive, directly providing optimized parameters for index models based on the distribution of node data. By leveraging the correlation between key distribution and node model parameters, LIOF significantly reduces the training epochs required for each node model. Initially, we introduce an optimization strategy inspired by optimization-based meta-learning to train the LIOF to generate optimized initial parameters for index node models. Subsequently, we present a data-driven encoder and a parameter-centric decoder network, which adaptively translate key distribution into a latent variable representation and decode it into optimized node model initialization. Additionally, to further utilize characteristics of key distribution, we propose a monotonic regularizer and focal loss, guiding LIOF training towards efficiency and precision. Through extensive experimentation on real-world and synthetic datasets, we demonstrate that LIOF provides substantial enhancements in both training efficiency and the predictive accuracy for learned indexes.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3499-3513"},"PeriodicalIF":8.9,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations","authors":"Yuyang Ding;Dan Qiao;Juntao Li;Jiajie Xu;Pingfu Chao;Xiaofang Zhou;Min Zhang","doi":"10.1109/TKDE.2025.3567204","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3567204","url":null,"abstract":"Distantly supervised named entity recognition (DS-NER) has emerged as a cheap and convenient alternative to traditional human annotation methods, enabling the automatic generation of training data by aligning text with external resources. Despite the many efforts in noise measurement methods, few works focus on the latent noise distribution between different distant annotation methods. In this work, we explore the effectiveness and robustness of DS-NER by two aspects: (1) distant annotation techniques, which encompasses both traditional rule-based methods and the innovative large language model supervision approach, and (2) noise assessment, for which we introduce a novel framework. This framework addresses the challenges by distinctly categorizing them into the <italic>unlabeled-entity problem (UEP)</i> and the <italic>noisy-entity problem (NEP)</i>, subsequently providing specialized solutions for each. Our proposed method achieves significant improvements on eight real-world distant supervision datasets originating from three different data sources and involving four distinct annotation techniques, confirming its superiority over current state-of-the-art methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 8","pages":"4880-4893"},"PeriodicalIF":8.9,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144572952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruijia Ma;Yahong Lian;Rongbo Qi;Chunyao Song;Tingjian Ge
{"title":"Valid Coverage Oriented Item Perspective Recommendation","authors":"Ruijia Ma;Yahong Lian;Rongbo Qi;Chunyao Song;Tingjian Ge","doi":"10.1109/TKDE.2025.3547968","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3547968","url":null,"abstract":"Today, mainstream recommendation systems have achieved remarkable success in recommending items that align with user interests. However, limited attention has been paid to the perspective of item providers. Content providers often desire that all their offerings, including unpopular or cold items, are <italic>displayed and appreciated by users</i>. To tackle the challenges of <italic>unfair exhibition and limited item acceptance coverage</i>, we introduce a novel recommendation perspective that enables items to “select” their most relevant users. We further introduce ItemRec, a straightforward plug-and-play approach that leverages mutual scores calculated by any model. The goal is to maximize the recommendation and acceptance of items by users. Through extensive experiments on three real-world datasets, we demonstrate that ItemRec can enhance valid coverage by up to 38.5% while maintaining comparable or superior recommendation quality. This improvement comes with only a minor increase in model inference time, ranging from 1.5% to 5%. Furthermore, when compared to thirteen state-of-the-art recommendation methods across accuracy, fairness, and diversity, ItemRec exhibits significant advantages as well. Specifically, ItemRec achieves an optimal balance between precision and valid coverage, showcasing an efficiency gain ranging from 1.8 to 45 times compared to other fairness-oriented methodologies.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3810-3823"},"PeriodicalIF":8.9,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143902678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Songwei Zhao;Bo Yu;Kang Yang;Sinuo Zhang;Jifeng Hu;Yuan Jiang;Philip S. Yu;Hechang Chen
{"title":"A Flexible Diffusion Convolution for Graph Neural Networks","authors":"Songwei Zhao;Bo Yu;Kang Yang;Sinuo Zhang;Jifeng Hu;Yuan Jiang;Philip S. Yu;Hechang Chen","doi":"10.1109/TKDE.2025.3547817","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3547817","url":null,"abstract":"Graph Neural Networks (GNNs) have been gaining more attention due to their excellent performance in modeling various graph-structured data. However, most of the current GNNs only consider fixed-neighbor discrete message-passing, disregarding the importance of the local structure of different nodes and the implicit information between nodes for smoothing features. Previous approaches either focus on adaptive selection for aggregation structures or treat discrete graph convolution as a continuous diffusion process, but none of them comprehensively considered the above issues, significantly limiting the model's performance. To this end, we present a novel approach called Flexible Diffusion Convolution (Flexi-DC), which exploits the neighborhood information of nodes to set a particular continuous diffusion for each node to smooth features. Specifically, Flexi-DC first extracts the local structure knowledge based on the degrees of nodes in the graph data and then injects it into the diffusion convolution module to smooth features. Additionally, we utilize the extracted knowledge to smooth labels. Flexi-DC is an efficient framework that can significantly improve the performance of most GNN architectures. Experimental results demonstrate that Flexi-DC outperforms their vanilla implementations by an average accuracy of 13.24% (GCN), 16.37% (JKNet), and 11.98% (ARMA) on nine graph datasets with different homophily ratios.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3118-3131"},"PeriodicalIF":8.9,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wensheng Gan;Gengsen Huang;Jian Weng;Tianlong Gu;Philip S. Yu
{"title":"Towards Target Sequential Rules","authors":"Wensheng Gan;Gengsen Huang;Jian Weng;Tianlong Gu;Philip S. Yu","doi":"10.1109/TKDE.2025.3547394","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3547394","url":null,"abstract":"In many real-world applications, sequential rule mining (SRM) can offer prediction and recommendation functions for a variety of services. It is an important technique of pattern mining to discover all valuable rules that can reveal the temporal relationship between objects. Although several algorithms of SRM are proposed to solve various practical problems, there are no studies on the problem of targeted mining. Targeted sequential rule mining aims to obtain those interesting sequential rules that users focus on, thus avoiding the generation of other invalid and unnecessary rules. It can further improve the efficiency of users in analyzing rules and reduce the consumption of computing resources. In this paper, we first present the relevant definitions of target sequential rules and formulate the problem of targeted sequential rule mining. Then, we propose an efficient algorithm called TaSRM. Several pruning strategies and an optimization are introduced to improve the efficiency of TaSRM. Finally, a large number of experiments are conducted on different benchmarks, and we analyze the results in terms of running time, memory consumption, and scalability, as well as query cases with different query rules. It is shown that the novel algorithm TaSRM and its variants can achieve better experimental performance compared to the baseline algorithm.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3766-3780"},"PeriodicalIF":8.9,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Zhang;Jue Wang;Huan Li;Zhongle Xie;Ke Chen;Lidan Shou
{"title":"${sf CHASe}$CHASe: Client Heterogeneity-Aware Data Selection for Effective Federated Active Learning","authors":"Jun Zhang;Jue Wang;Huan Li;Zhongle Xie;Ke Chen;Lidan Shou","doi":"10.1109/TKDE.2025.3547423","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3547423","url":null,"abstract":"Active learning (AL) reduces human annotation costs for machine learning systems by strategically selecting the most informative unlabeled data for annotation, but performing it individually may still be insufficient due to restricted data diversity and annotation budget. Federated Active Learning (FAL) addresses this by facilitating collaborative data selection and model training, while preserving the confidentiality of raw data samples. Yet, existing FAL methods fail to account for the heterogeneity of data distribution across clients and the associated fluctuations in global and local model parameters, adversely affecting model accuracy. To overcome these challenges, we propose <inline-formula><tex-math>${sf CHASe}$</tex-math></inline-formula> (Client Heterogeneity-Aware Data Selection), specifically designed for FAL. <inline-formula><tex-math>${sf CHASe}$</tex-math></inline-formula> focuses on identifying those unlabeled samples with high epistemic variations (EVs), which notably oscillate around the decision boundaries during training. To achieve both effectiveness and efficiency, <inline-formula><tex-math>${sf CHASe}$</tex-math></inline-formula> encompasses techniques for 1) tracking EVs by analyzing inference inconsistencies across training epochs, 2) calibrating decision boundaries of inaccurate models with a new alignment loss, and 3) enhancing data selection efficiency via a data freeze and awaken mechanism with subset sampling. Experiments show that <inline-formula><tex-math>${sf CHASe}$</tex-math></inline-formula> surpasses various established baselines in terms of effectiveness and efficiency, validated across diverse datasets, model complexities, and heterogeneous federation settings.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3088-3102"},"PeriodicalIF":8.9,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual-Channel Multiplex Graph Neural Networks for Recommendation","authors":"Xiang Li;Chaofan Fu;Zhongying Zhao;Guangjie Zheng;Chao Huang;Yanwei Yu;Junyu Dong","doi":"10.1109/TKDE.2025.3544081","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3544081","url":null,"abstract":"Effective recommender systems play a crucial role in accurately capturing user and item attributes that mirror individual preferences. Some existing recommendation techniques have started to shift their focus towards modeling various types of interactive relations between users and items in real-world recommendation scenarios, such as clicks, marking favorites, and purchases on online shopping platforms. Nevertheless, these approaches still grapple with two significant challenges: (1) Insufficient modeling and exploitation of the impact of various behavior patterns formed by multiplex relations between users and items on representation learning, and (2) ignoring the effect of different relations within behavior patterns on the target relation in recommender system scenarios. In this work, we introduce a novel recommendation framework, <bold><u>D</u></b>ual-<bold><u>C</u></b>hannel <bold><u>M</u></b>ultiplex <bold><u>G</u></b>raph <bold><u>N</u></b>eural <bold><u>N</u></b>etwork (DCMGNN), which addresses the aforementioned challenges. It incorporates an explicit behavior pattern representation learner to capture the behavior patterns composed of multiplex user-item interactive relations, and includes a relation chain representation learner and a relation chain-aware encoder to discover the impact of various auxiliary relations on the target relation, the dependencies between different relations, and mine the appropriate order of relations in a behavior pattern. Extensive experiments on three real-world datasets demonstrate that our DCMGNN surpasses various state-of-the-art recommendation methods. It outperforms the best baselines by 10.06% and 12.15% on average across all datasets in terms of Recall@10 and NDCG@10 respectively. The source code of our paper is available at <uri>https://github.com/lx970414/TKDE-DCMGNN</uri>.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3327-3341"},"PeriodicalIF":8.9,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU-Accelerated Structural Diversity Search in Graphs","authors":"Jinbin Huang;Xin Huang;Jianliang Xu;Byron Choi;Yun Peng","doi":"10.1109/TKDE.2025.3547443","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3547443","url":null,"abstract":"The problem of structural diversity search has been widely studied recently, which aims to find out the users with the highest structural diversity in social networks. The structural diversity of a user is depicted by the number of social contexts inside his/her contact neighborhood. Three structural diversity models based on cohesive subgraph models (e.g., k-sized component, k-core, and k-truss), have been proposed. Previous solutions only focus on CPU-based sequential solutions, suffering from several key steps of that cannot be highly parallelized. GPUs enjoy high-efficiency performance in parallel computing for solving many complex graph problems such as triangle counting, subgraph pattern matching, and graph decomposition. In this paper, we provide a unified framework to utilize multiple GPUs to accelerate the computation of structural diversity search under the mentioned three structural diversity models. We first propose a GPU-based lock-free method to efficiently extract ego-networks in CSR format in parallel. Second, we design detailed GPU-based solutions for computing <italic>k</i>-sized component-based, <italic>k</i>-core-based, and also <italic>k</i>-truss-based structural diversity scores by dynamically grouping GPU resources. To effectively optimize the workload balance among multiple GPUs, we propose a greedy work-packing scheme and a dynamic work-stealing strategy to fulfill usage. Extensive experiments on real-world datasets validate the superiority of our GPU-based structural diversity search solutions in terms of efficiency and effectiveness.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3413-3428"},"PeriodicalIF":8.9,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Summary Graph Induced Invariant Learning for Generalizable Graph Learning","authors":"Xuecheng Ning;Yujie Wang;Kui Yu;Jiali Miao;Fuyuan Cao;Jiye Liang","doi":"10.1109/TKDE.2025.3547226","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3547226","url":null,"abstract":"As a promising strategy to achieve generalizable graph learning tasks, graph invariant learning emphasizes identifying invariant subgraphs for stable predictions on biased unknown distribution by selecting the important edges/nodes based on their contributions to the predictive tasks (i.e., subgraph predictivity). However, the existing approaches solely relying on subgraph predictivity face a challenge: the learned invariant subgraph often contains numerous spurious nodes and shows poor connectivity, undermining the generalization power of Graph Neural Networks (GNNs). To tackle this issue, we propose a summary graph-induced Invariant Learning (SIL) model that innovatively adopts a summary graph to leverage both the subgraph connectivity and predictivity for learning strong connected and accurate invariant subgraphs. Specifically, SIL first learns a summary graph containing multiple strongly connected supernodes while maintaining structure consistency with the original graph. Second, the learned summary graph is disentangled into an invariant supernode and spurious counterparts to eliminate the interference of highly predictive edges and nodes. Finally, SIL identifies a potential invariant subgraph from the invariant supernode to accomplish generalization tasks. Additionally, we provide a theoretical analysis of the summary graph learning mechanism, guaranteeing that the learned summary graph is consistent with the original graph. Experimental results validate the effectiveness of the SIL model.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3739-3752"},"PeriodicalIF":8.9,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}