{"title":"Improving Sequential Recommendations via Bidirectional Temporal Data Augmentation With Pre-Training","authors":"Juyong Jiang;Peiyan Zhang;Yingtao Luo;Chaozhuo Li;Jae Boum Kim;Kai Zhang;Senzhang Wang;Sunghun Kim;Philip S. Yu","doi":"10.1109/TKDE.2025.3546035","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3546035","url":null,"abstract":"Sequential recommendation systems are integral to discerning temporal user preferences. Yet, the task of learning from abbreviated user interaction sequences poses a notable challenge. Data augmentation has been identified as a potent strategy to enhance the informational richness of these sequences. Traditional augmentation techniques, such as item randomization, may disrupt the inherent temporal dynamics. Although recent advancements in reverse chronological pseudo-item generation have shown promise, they can introduce temporal discrepancies when assessed in a natural chronological context. In response, we introduce a sophisticated approach, Bidirectional temporal data Augmentation with pre-training (BARec). Our approach leverages bidirectional temporal augmentation and knowledge-enhanced fine-tuning to synthesize authentic pseudo-prior items that <italic>retain user preferences and capture deeper item semantic correlations</i>, thus boosting the model’s expressive power. Our comprehensive experimental analysis on five benchmark datasets confirms the superiority of BARec across both short and elongated sequence contexts. Moreover, theoretical examination and case study offer further insight into the model’s logical processes and interpretability.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2652-2664"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Chen;Haoyu Huang;Zhiyu Zhang;Tianyi Wang;Youfang Lin;Liang Chang;Huaiyu Wan
{"title":"Next-POI Recommendation via Spatial-Temporal Knowledge Graph Contrastive Learning and Trajectory Prompt","authors":"Wei Chen;Haoyu Huang;Zhiyu Zhang;Tianyi Wang;Youfang Lin;Liang Chang;Huaiyu Wan","doi":"10.1109/TKDE.2025.3545958","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3545958","url":null,"abstract":"Next POI (Point-of-Interest) recommendation aims to forecast users’ future movements based on their historical check-in trajectories, holding significant value in location-based services. Existing methods address trajectory data sparsity by integrating rich auxiliary information or using spatial-temporal knowledge graphs (STKGs), showing promising results. Yet, they face two main challenges: i) Due to the difficulty of transforming structured trajectory data into trajectory text describing users’ spatial-temporal mobility, the powerful reasoning ability of pre-trained language models is rarely explored to enhance recommendation performance. ii) Methods based on STKG can introduce external knowledge inconsistent with user preferences, leading to the knowledge noise generated hampering the accuracy of recommendations. To this end, we propose a novel approach called STKG-PLM that integrates <underline>STKG</u> contrastive learning and <underline>p</u>rompt pre-trained <underline>l</u>anguage <underline>m</u>odel (PLM) to enhance the next POI recommendation. Specifically, we design a spatial-temporal trajectory prompt template that transforms structured trajectories into text corpus based on STKG, serving as the input of PLM to understand the movement pattern of users from coarse-grained and fine-grained perspectives. Additionally, we propose an STKG contrastive learning framework to mitigate the introduced knowledge noise. Extensive experiments on three real-world datasets demonstrate that STKG-PLM exhibits notable performance improvements over the state-of-the-art baseline methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3570-3582"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating Multi-Label Expected Accuracy Using Labelset Distributions","authors":"Laurence A. F. Park;Jesse Read","doi":"10.1109/TKDE.2025.3545972","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3545972","url":null,"abstract":"A multi-label classifier estimates the binary label state (relevant/irrelevant) for each of a set of concept labels, for a given instance. Probabilistic multi-label classifiers provide a distribution over all possible labelset combinations of such label states (the powerset of labels), from which we can provide the best estimate by selecting the labelset corresponding to the largest expected accuracy. Providing confidence for predictions is important for real-world application of multi-label models, which provides the practitioner with a sense of the correctness of the prediction. It has been thought that the probability of the chosen labelset is a good measure of the confidence of the prediction, but multi-label accuracy can be measured in many ways and so confidence should align with the expected accuracy of the evaluation method. In this article, we investigate the effectiveness of seven candidate functions for estimating multi-label expected accuracy conditioned on the labelset distribution and the evaluation method. We found most correlate to expected accuracy and have varying levels of robustness. Further, we found that the candidate functions provide high expected accuracy estimates for Hamming similarity, but a combination of the candidates provided an accurate estimate of expected accuracy for Jaccard index and Exact match.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2513-2524"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolai A. Weinreich;Arman Oshnoei;Remus Teodorescu;Kim G. Larsen
{"title":"Doing More With Less: A Survey of Data Selection Methods for Mathematical Modeling","authors":"Nicolai A. Weinreich;Arman Oshnoei;Remus Teodorescu;Kim G. Larsen","doi":"10.1109/TKDE.2025.3545965","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3545965","url":null,"abstract":"Big data applications such as Artificial Intelligence (AI) and Internet of Things (IoT) have in recent years been leading to many technological breakthroughs in system modeling. However, these applications are typically data intensive, thus requiring an increasing cost of resources. In this paper, a first-of-its-kind comprehensive review of data selection methods across different engineering disciplines is given in order to analyze the effectiveness of these methods in improving the data efficiency of mathematical modeling algorithms. Eight distinct selection methods have been identified and subsequently analyzed and discussed on the basis of the relevant literature. In addition, the selection methods have been classified according to three dichotomies established by the survey. A comparative analysis of these methods was conducted along with a discussion of potentials, challenges, and future research directions for the research area. Data selection was found to be widely used in many engineering applications and has the potential to play an important role in making more sustainable Big Data applications, especially those in which transmission of data across large distances is required. Furthermore, making resource-aware decisions about the use of data has been shown to be highly effective in reducing energy costs while ensuring high performance of the model.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2420-2439"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10904270","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dejun Teng;Zhaochuan Li;Zhaohui Peng;Shuai Ma;Fusheng Wang
{"title":"Efficient and Accurate Spatial Queries Using Lossy Compressed 3D Geometry Data","authors":"Dejun Teng;Zhaochuan Li;Zhaohui Peng;Shuai Ma;Fusheng Wang","doi":"10.1109/TKDE.2025.3539729","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3539729","url":null,"abstract":"3D spatial data management is increasingly vital across various application scenarios, such as GIS, digital twins, human atlases, and tissue imaging. However, the inherent complexity of 3D spatial data, primarily represented by 3D geometries in real-world applications, hinders the efficient evaluation of spatial relationships through resource-intensive geometric computations. Geometric simplification algorithms have been developed to reduce the complexity of 3D representations, albeit at the cost of querying accuracy. Previous work has aimed to address precision loss by leveraging the spatial relationship between the simplified and original 3D object representations. However, this approach relied on specialized geometric simplification algorithms tailored to regions with specific criteria. In this paper, we introduce a novel approach to achieve highly efficient and accurate 3D spatial queries, incorporating geometric computation and simplification. We present a generalized progressive refinement methodology applicable to general geometric simplification algorithms, involving accurate querying of 3D geometry data using low-resolution representations and simplification extents quantified using Hausdorff distances at the facet level. Additionally, we propose techniques for calculating and storing Hausdorff distances efficiently. Extensive experimental evaluations validate the effectiveness of the proposed method which outperforms state-of-the-art systems by a factor of 4 while minimizing computational and storage overhead.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2472-2487"},"PeriodicalIF":8.9,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Expandable Borderline Smote Over-Sampling Method for Class Imbalance Problem","authors":"Hao Sun;Jianping Li;Xiaoqian Zhu","doi":"10.1109/TKDE.2025.3544284","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3544284","url":null,"abstract":"The class imbalance problem can cause classifiers to be biased toward the majority class and inclined to generate incorrect predictions. While existing studies have proposed numerous oversampling methods to alleviate class imbalance by generating extra minority class samples, these methods still have some inherent weaknesses and make the generated samples less informative. This study proposes a novel over-sampling method named the Expandable Borderline Smote (EB-Smote), which can address the weaknesses of existing over-sampling methods and generate more informative synthetic samples. In EB-Smote, not only minority class but also majority class is oversampled, and the synthetic samples are generated in the area between the selected minority and majority samples, which are close to the borderlines of their respective classes. EB-Smote can generate more informative samples by expanding the borderlines of minority and majority classes toward the actual decision boundary. Based on 27 imbalanced datasets and commonly used machine learning models, the experimental results demonstrate that EB-Smote significantly outperforms the other 8 existing oversampling methods. This study can provide theoretical guidance and practical recommendations to solve the crucial class imbalance problem in classification tasks.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2183-2199"},"PeriodicalIF":8.9,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zemin Chao;Hong Gao;Dongjing Miao;Jianzhong Li;Hongzhi Wang
{"title":"An Amortized O(1) Lower Bound for Dynamic Time Warping in Motif Discovery","authors":"Zemin Chao;Hong Gao;Dongjing Miao;Jianzhong Li;Hongzhi Wang","doi":"10.1109/TKDE.2025.3544751","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3544751","url":null,"abstract":"Motif discovery is a critical operation for analyzing series data in many applications. Recent works demonstrate the importance of finding motifs with Dynamic Time Warping. However, existing algorithms spend most of their time in computing lower bounds of Dynamic Time Warping to filter out the unpromising candidates. Specifically, the time complexity for computing these lower bounds is <inline-formula><tex-math>$O(L)$</tex-math></inline-formula> for each pair of subsequences, where <inline-formula><tex-math>$L$</tex-math></inline-formula> is the length of the motif (subsequences). This paper proposes two new lower bounds, called <inline-formula><tex-math>$LB_{f}$</tex-math></inline-formula> and <inline-formula><tex-math>$LB_{M}$</tex-math></inline-formula>, both of them only cost amortized <inline-formula><tex-math>$O(1)$</tex-math></inline-formula> time for each pair of subsequences. On real datasets, the proposed lower bounds are at least one magnitude faster than the state-of-the-art lower bounds used in motif discovery while still keeping satisfying effectiveness. Based on these faster lower bounds, this paper designs an efficient motif discovery algorithm that significantly reduces the cost of lower bounds. The experiments conducted on real datasets show the proposed algorithm is 5.6 times faster than the state-of-the-art algorithms on average.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2239-2252"},"PeriodicalIF":8.9,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Kohan Marzagão;Trung Dong Huynh;Ayah Helal;Sean Baccas;Luc Moreau
{"title":"Provenance Graph Kernel","authors":"David Kohan Marzagão;Trung Dong Huynh;Ayah Helal;Sean Baccas;Luc Moreau","doi":"10.1109/TKDE.2025.3543097","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3543097","url":null,"abstract":"Provenance is a standardised record that describes how entities, activities, and agents have influenced a piece of data; it is commonly represented as graphs with relevant labels on both their nodes and edges. With the growing adoption of provenance in a wide range of application domains, users are increasingly confronted with an abundance of graph data, which may prove challenging to process. Graph kernels, on the other hand, have been successfully used to efficiently analyse graphs. In this paper, we introduce a novel graph kernel called <italic>provenance kernel</i>, which is inspired by and tailored for provenance data. We employ provenance kernels to classify provenance graphs from three application domains. Our evaluation shows that they perform well in terms of classification accuracy and yield competitive results when compared against existing graph kernel methods and the provenance network analytics method while more efficient in computing time. Moreover, the provenance types used by provenance kernels are a symbolic representation of a tree pattern which can, in turn, be described using the domain-agnostic vocabulary of provenance. Therefore, provenance types thus allow for the creation of explanations of predictive models built on them.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3653-3668"},"PeriodicalIF":8.9,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns","authors":"Yuxiang Guo;Yuren Mao;Zhonghao Hu;Lu Chen;Yunjun Gao","doi":"10.1109/TKDE.2025.3545176","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3545176","url":null,"abstract":"Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework, <inline-formula><tex-math>${sf Snoopy}$</tex-math></inline-formula>, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection. To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that <inline-formula><tex-math>${sf Snoopy}$</tex-math></inline-formula> outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency—being at least 5 orders of magnitude faster than cell-level solutions, and 3.5× faster than existing column-level methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2971-2985"},"PeriodicalIF":8.9,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CGoFed: Constrained Gradient Optimization Strategy for Federated Class Incremental Learning","authors":"Jiyuan Feng;Xu Yang;Liwen Liang;Weihong Han;Binxing Fang;Qing Liao","doi":"10.1109/TKDE.2025.3544605","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3544605","url":null,"abstract":"Federated Class Incremental Learning (FCIL) has emerged as a new paradigm due to its applicability in real-world scenarios. In FCIL, clients continuously generate new data with unseen class labels and do not share local data due to privacy restrictions, and each client’s class distribution evolves dynamically and independently. However, existing work still faces two significant challenges. Firstly, current methods lack a better balance between maintaining sound anti-forgetting effects over old data (stability) and ensuring good adaptability for new tasks (plasticity). Secondly, some FCIL methods overlook that the incremental data will also have a non-identical label distribution, leading to poor performance. This paper proposes CGoFed, which includes relax-constrained gradient update and cross-task gradient regularization modules. The relax-constrained gradient update prevents forgetting the knowledge about old data while quickly adapting to the new data by constraining the gradient update direction to a gradient space that minimizes interference with historical tasks. The cross-task gradient regularization also finds applicable historical models from other clients and trains a personalized global model to address the non-identical label distribution problem. The results demonstrate that the CGoFed performs well in alleviating catastrophic forgetting and improves model performance by 8% -23% compared with the SOTA comparison method.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2282-2295"},"PeriodicalIF":8.9,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}