Abdul Atif Khan;Mohammad Maksood Akhter;Rashmi Maheshwari;Sraban Kumar Mohanty
{"title":"L-ASCRA: A Linearithmic Time Approximate Spectral Clustering Algorithm Using Topologically-Preserved Representatives","authors":"Abdul Atif Khan;Mohammad Maksood Akhter;Rashmi Maheshwari;Sraban Kumar Mohanty","doi":"10.1109/TKDE.2024.3483572","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3483572","url":null,"abstract":"Approximate spectral clustering (ASC) algorithms work on the representative points of the data for discovering intrinsic groups. The existing ASC methods identify fewer representatives as compared to the number of data points to reduce the cubic computational overhead of the spectral clustering technique. However, identifying such representative points without any domain knowledge to capture the shapes and topology of the clusters remains a challenge. This work proposes an ASC method that suitably computes enough well-scattered representatives to efficiently capture the topology of the data, making the ASC faster without the requirement of tuning any external parameters. The proposed ASC algorithm first applies two-level partitioning using both boundary points and centroids-based partitioning to identify quality representatives in less time. In the next step, we calculate the proximity between the neighboring representatives using \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-rounds of minimum spanning tree (MST) by considering the distribution of edge weights in each round to find \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000. The proposed method effectively utilizes the number of representatives in a way that the overall computational time is bounded by \u0000<inline-formula><tex-math>$O(Nlg N)$</tex-math></inline-formula>\u0000. The experimental results suggest that the proposed ASC method outperforms the competing ASC methods in terms of both running time and clustering quality.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"8643-8654"},"PeriodicalIF":8.9,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Huang;Yunxiao Zhang;Shangmin Guo;Yu-Ming Shang;Xiangling Fu
{"title":"DynImpt: A Dynamic Data Selection Method for Improving Model Training Efficiency","authors":"Wei Huang;Yunxiao Zhang;Shangmin Guo;Yu-Ming Shang;Xiangling Fu","doi":"10.1109/TKDE.2024.3482466","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3482466","url":null,"abstract":"Selecting key data subsets for model training is an effective way to improve training efficiency. Existing methods generally utilize a well-trained model to evaluate samples and select crucial subsets, ignoring the fact that the sample importance changes dynamically during model training, resulting in the selected subset only being critical in a specific training epoch rather than a changing training phase. To address this issue, we attempt to evaluate the significant changes in sample importance during dynamic training and propose a novel data selection method to improve model training efficiency. Specifically, the temporal changes in sample importance are considered from three perspectives: (i) loss, the difference between the predicted labels and the true labels of samples in the current training epoch; (ii) instability, the dispersion of sample importance in the recent training phase; and (iii) inconsistency, the comparison of the changing trend in the importance of an individual sample relative to the average importance of all samples in the recent training phase. Extensive experiments demonstrate that dynamic data selection can reduce computational costs and improve model training efficiency. Additionally, we find that the difficulty level of the training task influences the data selection strategy.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"239-252"},"PeriodicalIF":8.9,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Link Prediction via GNN Layers Induced by Negative Sampling","authors":"Yuxin Wang;Xiannian Hu;Quan Gan;Xuanjing Huang;Xipeng Qiu;David Wipf","doi":"10.1109/TKDE.2024.3481015","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3481015","url":null,"abstract":"Graph neural networks (GNNs) for link prediction can loosely be divided into two broad categories. First, \u0000<italic>node-wise</i>\u0000 architectures pre-compute individual embeddings for each node that are later combined by a simple decoder to make predictions. While extremely efficient at inference time, model expressiveness is limited such that isomorphic nodes contributing to candidate edges may not be distinguishable, compromising accuracy. In contrast, \u0000<italic>edge-wise</i>\u0000 methods rely on the formation of edge-specific subgraph embeddings to enrich the representation of pair-wise relationships, disambiguating isomorphic nodes to improve accuracy, but with increased model complexity. To better navigate this trade-off, we propose a novel GNN architecture whereby the \u0000<italic>forward pass</i>\u0000 explicitly depends on \u0000<italic>both</i>\u0000 positive (as is typical) and negative (unique to our approach) edges to inform more flexible, yet still cheap node-wise embeddings. This is achieved by recasting the embeddings themselves as minimizers of a forward-pass-specific energy function that favors separation of positive and negative samples. Notably, this energy is distinct from the actual training loss shared by most existing link prediction models, where contrastive pairs only influence the \u0000<italic>backward pass</i>\u0000. As demonstrated by extensive empirical evaluations, the resulting architecture retains the inference speed of node-wise models, while producing competitive accuracy with edge-wise alternatives.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 1","pages":"253-264"},"PeriodicalIF":8.9,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Human-AI Interaction: Human Behavior Routineness Shapes AI Performance","authors":"Tianao Sun;Kai Zhao;Meng Chen","doi":"10.1109/TKDE.2024.3480317","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3480317","url":null,"abstract":"A crucial area of research in Human-AI Interaction focuses on understanding how the integration of AI into social systems influences human behavior, for example, how news-feeding algorithms affect people’s voting decisions. But little attention has been paid to how human behavior shapes AI performance. We fill this research gap by introducing \u0000<italic>routineness</i>\u0000 to measure human behavior for the AI system, which assesses the degree of routine in a person’s activity based on their past activities. We apply the proposed \u0000<italic>routineness</i>\u0000 metric to two extensive human behavior datasets: the human mobility dataset with over 700 million data samples and the social media dataset with over 3.8 million data samples. Our analysis reveals \u0000<italic>routineness</i>\u0000 can effectively detect behavioral changes in human activities. The performance of AI algorithms is profoundly determined by human \u0000<italic>routineness</i>\u0000, which provides valuable guidance for the selection of AI algorithms.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"8476-8487"},"PeriodicalIF":8.9,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sai Wu;Meng Shi;Dongxiang Zhang;Junbo Zhao;Gongsheng Yuan;Gang Chen
{"title":"When Quantum Computing Meets Database: A Hybrid Sampling Framework for Approximate Query Processing","authors":"Sai Wu;Meng Shi;Dongxiang Zhang;Junbo Zhao;Gongsheng Yuan;Gang Chen","doi":"10.1109/TKDE.2024.3480278","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3480278","url":null,"abstract":"Quantum computing represents a next-generation technology in data processing, promising to transcend the limitations of traditional computation. In this paper, we undertake an early exploration of the potential integration of quantum computing with database query optimization. We introduce a pioneering hybrid classical-quantum algorithm for sampling-based approximate query processing (AQP). The core concept of the algorithm revolves around identifying rare groups, which often follow a long-tail distribution, and applying distinct sampling methodologies to normal and rare groups. By leveraging the quantum capabilities of the diffusion gate and QRAM, the algorithm defines a novel quantum sampling approach that iteratively amplifies the signals of these infrequent groups. The algorithm operates without the need for preprocessing or prior knowledge of workloads or data. It utilizes the power of quadratic acceleration to achieve well-balanced sampling across various data categories. Experimental results demonstrate that in the context of AQP, the new sampling scheme provides higher accuracy at the same sampling cost. Additionally, the benefits of quantum computing become more pronounced as query selectivity increases.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"9532-9546"},"PeriodicalIF":8.9,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SUHDSA: Secure, Useful, and High-Performance Data Stream Anonymization","authors":"Yongwan Joo;Soonseok Kim","doi":"10.1109/TKDE.2024.3476684","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3476684","url":null,"abstract":"This study addresses privacy concerns in real-time streaming data, including personal biometric signals and private information from sources such as real-time crime reporting, online sales transactions, and hospital patient-monitoring devices. Anonymization is crucial because it hides sensitive personal data. Achieving anonymity in real-time streaming data involves satisfying the unique demands of real-time scenarios, which is distinct from traditional methods. Specifically, security and minimal information loss must be maintained within a specified timeframe (referred to as the average delay time). The most recent solution in this context is the utility-based approach to data stream anonymization (UBDSA) algorithm developed by Sopaoglu and Abul. This study aims to enhance the performance of UBDSA by introducing a secure, useful, and high-performance data stream anonymization (SUHDSA) algorithm. SUHDSA outperforms UBDSA in terms of runtime and information loss while still ensuring privacy protection and an average delay time. The experimental results, using the same dataset and cluster size as in a previous UBDSA study, demonstrate significant performance improvements with the proposed algorithm. It achieves a minimum runtime of 24.05 s and a maximum runtime of 29.88 s, with information loss rates ranging from 14% to 77%. These results surpass the performance of the previous UBDSA algorithm.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"9336-9347"},"PeriodicalIF":8.9,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10715680","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Debiased Pairwise Learning for Implicit Collaborative Filtering","authors":"Bin Liu;Qin Luo;Bang Wang","doi":"10.1109/TKDE.2024.3479240","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3479240","url":null,"abstract":"Learning representations from pairwise comparisons has achieved significant success in various fields, including computer vision and information retrieval. In recommendation systems, collaborative filtering algorithms based on pairwise learning are also rooted in this approach. However, a major challenge in collaborative filtering is the lack of labels for negative instances in implicit feedback data, leading to the inclusion of false negatives among randomly selected instances. This issue causes biased optimization objectives and results in biased parameter estimation. In this paper, we propose a novel method to address learning biases arising from implicit feedback data and introduce a modified loss function for pairwise learning, called debiased pairwise loss (DPL). The core idea of DPL is to correct the biased probability estimates caused by false negatives, thereby adjusting the gradients to more closely approximate those of fully supervised data. Implementing DPL requires only a small modification to the existing codebase. Experimental studies on public datasets demonstrate the effectiveness of the proposed method.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"7878-7892"},"PeriodicalIF":8.9,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accurate and Scalable Graph Convolutional Networks for Recommendation Based on Subgraph Propagation","authors":"Xueqi Li;Guoqing Xiao;Yuedan Chen;Kenli Li;Gao Cong","doi":"10.1109/TKDE.2024.3467333","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3467333","url":null,"abstract":"In recommendation systems, Graph Convolutional Networks (GCNs) often suffer from significant computational and memory cost when propagating features across the entire user-item graph. While various sampling strategies have been introduced to reduce the cost, the challenge of neighbor explosion persists, primarily due to the iterative nature of neighbor aggregation. This work focuses on exploring subgraph propagation for scalable recommendation by addressing two primary challenges: \u0000<italic>efficient and effective subgraph construction</i>\u0000 and \u0000<italic>subgraph sparsity</i>\u0000. To address these challenges, we propose a novel \u0000<underline>GCN</u>\u0000 model for recommendation based on \u0000<underline>Sub</u>\u0000graph propagation, called SubGCN. One key component of SubGCN is BiPPR, a technique that fuses both source- and target-based Personalized PageRank (PPR) approximations, to overcome the challenge of \u0000<italic>efficient and effective subgraph construction</i>\u0000. Furthermore, we propose a source-target contrastive learning scheme to mitigate the impact of \u0000<italic>subgraph sparsity</i>\u0000 for SubGCN. We conduct extensive experiments on two large and two medium-sized datasets to evaluate the scalability, efficiency, and effectiveness of SubGCN. On medium-sized datasets, compared to full-graph GCNs, SubGCN achieves competitive accuracy while using only 23.79% training time on Gowalla and 16.3% on Yelp2018. On large datasets, where full-graph GCNs ran out of the GPU memory, our proposed SubGCN outperforms widely used sampling strategies in terms of training efficiency and recommendation accuracy.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"7556-7568"},"PeriodicalIF":8.9,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142645549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guofan Liu;Jinghao Zhang;Qiang Liu;Junfei Wu;Shu Wu;Liang Wang
{"title":"Uni-Modal Event-Agnostic Knowledge Distillation for Multimodal Fake News Detection","authors":"Guofan Liu;Jinghao Zhang;Qiang Liu;Junfei Wu;Shu Wu;Liang Wang","doi":"10.1109/TKDE.2024.3477977","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3477977","url":null,"abstract":"With the rapid expansion of multimodal content in online social media, automatic detection of multimodal fake news has received much attention. Multimodal joint training commonly used in existing methods is expected to benefit from thoroughly leveraging cross-modal features, yet these methods still suffer from insufficient learning of uni-modal features. Due to the heterogeneity of multimodal networks, optimizing a single objective will inevitably make the models prone to rely on specific modality while leaving other modalities under-optimized. On the other hand, simply expecting each modality to play a significant role in identifying all the rumors is also not appropriate as the multimodal fake news often involves tampering in only one modality. Therefore, how to find the genuine tampering on the per-sample basis becomes the key point to unlock the full power of each modality in a good collaborative manner. To address these issues, we propose a \u0000<bold><u>U</u></b>\u0000ni-modal \u0000<bold><u>E</u></b>\u0000vent-agnostic \u0000<bold><u>K</u></b>\u0000nowledge \u0000<bold><u>D</u></b>\u0000istillation framework (UEKD), which aims to transfer knowledge contained in the fine-grained prediction from uni-modal teachers to the multimodal student model through modality-specific distillation. Specifically, we find that the uni-modal teachers simply trained on the whole training set are easy to memorize the event-specific noise information to make a correct but biased prediction, failing to reflect the genuine degree of tampering in each modality. To tackle this problem, we propose to train and validate the teacher models on different domains in training dataset through a cross-validation manner, as the predictions from the out-of-domain teachers can be regarded as event-agnostic knowledge without spurious connections with event-specific information. Finally, to balance the convergence speeds across modalities, we dynamically monitor the involvement of each modality during training, through which we could identify the more under-optimized modalities and re-weight the distillation loss accordingly. Our method could be served as a plug-and-play module for existing multimodal fake news detection backbones. Extensive experiments on three public datasets and four state-of-the-art fake news detection backbones show that our proposed method can improve the performance by a large margin.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"9490-9503"},"PeriodicalIF":8.9,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneous Multivariate Functional Time Series Modeling: A State Space Approach","authors":"Peiyao Liu;Junpeng Lin;Chen Zhang","doi":"10.1109/TKDE.2024.3472906","DOIUrl":"https://doi.org/10.1109/TKDE.2024.3472906","url":null,"abstract":"Functional data have been gaining increasing popularity in the field of time series analysis. However, so far modeling heterogeneous multivariate functional time series remains a research gap. To fill it, this paper proposes a time-varying functional state space model (TV-FSSM). It uses functional decomposition to extract features of the functional observations, where the decomposition coefficients are regarded as latent states that evolve according to a tensor autoregressive model. This two-layer structure can on the one hand efficiently extract continuous functional features, and on the other provide a flexible and generalized description of data heterogeneity among different time points. An expectation maximization (EM) framework is developed for parameter estimation, where regularization and constraints are incorporated for better model interoperability. As the sample size grows, an incremental learning version of the EM algorithm is given to efficiently update the model parameters. Some model properties, including model identifiability conditions, convergence issues, time complexities, and bounds of its one-step-ahead prediction errors, are also presented. Extensive experiments on both real and synthetic datasets are performed to evaluate the predictive accuracy and efficiency of the proposed framework.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"36 12","pages":"8421-8433"},"PeriodicalIF":8.9,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}