{"title":"LOFTune: A Low-Overhead and Flexible Approach for Spark SQL Configuration Tuning","authors":"Jiahui Li;Junhao Ye;Yuren Mao;Yunjun Gao;Lu Chen","doi":"10.1109/TKDE.2025.3549232","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3549232","url":null,"abstract":"The query efficiency of Spark SQL is significantly impacted by its configurations. Therefore, configuration tuning has drawn great attention, and various automatic configuration tuning methods have been proposed. However, existing methods suffer from two issues: (1) high tuning overhead: they need to repeatedly execute the workloads several times to obtain the training samples, which is time-consuming; and (2) low throughput: they need to occupy resources like CPU cores and memory for a long time, causing other Spark SQL workloads to wait, thereby reducing the overall system throughput. These issues impede the use of automatic configuration tuning methods in practical systems which have limited tuning budget and many concurrent workloads. To address these issues, this paper proposes a <bold>L</b>ow-<bold>O</b>verhead and <bold>F</b>lexible approach for Spark SQL configuration <bold>Tuning</b>, dubbed <bold>LOFTune</b>. LOFTune reduces the tuning overhead via a sample-efficient optimization framework, which is proposed based on multi-task SQL representation learning and multi-armed bandit. Furthermore, LOFTune solves the low throughput issue with a recommendation-sampling-decoupled tuning framework. Extensive experiments validate the effectiveness of LOFTune. In the sampling-allowed case, LOFTune can save up to 90% of the workload runs comparing with the state-of-the-art methods. Besides, in the zero-sampling case, LOFTune can reduce up to 41.26% of latency.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3528-3542"},"PeriodicalIF":8.9,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Zkfhed: A Verifiable and Scalable Blockchain-Enhanced Federated Learning System","authors":"Bingxue Zhang;Guangguang Lu;Yuncheng Wu;Kunpeng Ren;Feida Zhu","doi":"10.1109/TKDE.2025.3550546","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3550546","url":null,"abstract":"Federated learning (FL) is an emerging paradigm that enables multiple clients to collaboratively train a machine learning (ML) model without the need to exchange their raw data. However, it relies on a centralized authority to coordinate participants’ activities. This not only interrupts the entire training task in case of a single point of failure, but also lacks an effective regulatory mechanism to prevent malicious behavior. Although blockchain, with its decentralized architecture and data immutability, has significantly advanced the development of FL, it still struggles to withstand poisoning attacks and faces limitations in computational scalability. We propose Zkfhed, a verifiable and scalable FL system that overcomes the limitations of blockchain-based FL in poison attacks and computational scalability. First, we propose a two-stage audit scheme based on zero-knowledge proofs (ZKPs), which verifies that the training data are extracted from trusted organizations and that computations on the data exactly follow the specified training protocols. Second, we propose a homomorphic encryption delegation learning (HEDL), based on fully homomorphic encryption (FHE). It is capable of outsourcing complex computing to external computing resources without sacrificing the client's data privacy. Final, extensive experiments on real-world datasets demonstrate that Zkfhed can effectively identify malicious clients and is highly efficient and scalable in terms of online time and communication efficiency.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3841-3854"},"PeriodicalIF":8.9,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143902652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiscale Weisfeiler-Leman Directed Graph Neural Networks for Prerequisite-Link Prediction","authors":"Yupei Zhang;Xiran Qu;Shuhui Liu;Yan Pang;Xuequn Shang","doi":"10.1109/TKDE.2025.3552045","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3552045","url":null,"abstract":"Prerequisite-link Prediction (PLP) aims to discover the condition relations of a specific event or a concerned variable, which is a fundamental problem in a large number of fields, such as educational data mining. Current studies on PLP usually developed graph neural networks (GNNs) to learn the representations of pairs of nodes. However, these models fail to distinguish non-isomorphic graphs and integrate multiscale structures, leading to the insufficient expressive capability of GNNs. To this end, we in this paper proposed <italic>k</i>-dimensional Weisferiler-Leman directed GNNs, dubbed <italic>k</i>-WediGNNs, to recognize non-isomorphic graphs via the Weisferiler-Leman algorithm. Furthermore, we integrated the multiscale structures of a directed graph into <italic>k</i>-WediGNNs, dubbed multiscale <italic>k</i>-WediGNNs, from the bidirected views of in-degree and out-degree. With the Siamese network, the proposed models are extended to address the problem of PLP. Besides, the expressive power is then interpreted via theoretical proofs. The experiments were conducted on four publicly available datasets for concept prerequisite relation prediction (CPRP). The results show that the proposed models achieve better performance than the state-of-the-art approaches, where our multiscale <italic>k</i>-WediGNN achieves a new benchmark in the task of CPRP.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3556-3569"},"PeriodicalIF":8.9,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Searching and Querying Maximum Directed $(k,ell )$(k,ℓ)-Plex","authors":"Shuohao Gao;Kaiqiang Yu;Shengxin Liu;Cheng Long;Xun Zhou","doi":"10.1109/TKDE.2025.3569755","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3569755","url":null,"abstract":"Finding cohesive subgraphs from a directed graph is a fundamental approach to analyze directed graph data. We consider a new model called directed <inline-formula><tex-math>$(k,ell )$</tex-math></inline-formula>-plex for a cohesive directed subgraph, which is generalized from the concept of <inline-formula><tex-math>$k$</tex-math></inline-formula>-plex that is only applicable to undirected graphs. Directed <inline-formula><tex-math>$(k,ell )$</tex-math></inline-formula>-plex (or DPlex) has the connection requirements on both inbound and outbound directions of each vertex inside, i.e., each vertex disconnects at most <inline-formula><tex-math>$k$</tex-math></inline-formula> vertices and is meanwhile not pointed to by at most <inline-formula><tex-math>$ell$</tex-math></inline-formula> vertices. In this paper, we study the maximum DPlex search problem which finds a DPlex with the most vertices. We formally prove the NP-hardness of the problem. We then design a heuristic algorithm called <monospace>DPHeuris</monospace>, which finds a DPlex with the size close to the maximum one and runs practically fast in polynomial time. Furthermore, we propose a branch-and-bound algorithm called <monospace>DPBB</monospace> to find the exact maximum DPlex and develop effective graph reduction strategies for boosting the empirical performance. We also consider the problem of querying personalized maximum DPlex, and design a new method called <monospace>DPBBQ</monospace> for the problem. Finally, we conduct extensive experiments on real directed graphs. The experimental results show that (1) our heuristic method can quickly find a near-optimal solution and (2) our branch-and-bound method runs up to six orders of magnitude faster than other baselines.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 8","pages":"4743-4757"},"PeriodicalIF":8.9,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144572951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zihua Zhao;Danyang Wu;Rong Wang;Zheng Wang;Feiping Nie;Xuelong Li
{"title":"Graph-Based Clustering: High-Order Bipartite Graph for Proximity Learning","authors":"Zihua Zhao;Danyang Wu;Rong Wang;Zheng Wang;Feiping Nie;Xuelong Li","doi":"10.1109/TKDE.2025.3569681","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3569681","url":null,"abstract":"Structured proximity matrix learning, one of the mainstream directions in clustering research, refers to learning a proximity matrix with an explicit clustering structure from the original first-order proximity matrix. Due to the complexity of the data structure, the original first-order proximity matrix always lacks some must-links compared to the groundtruth proximity matrix. It is worth noting that high-order proximity matrices can provide missed must-link information. However, the computation of high-order proximity matrices and clustering based on them are expensive. To solve the above problem, inspired by the anchor bipartite graph, we present a novel high-order bipartite graph proximity matrix and a fast method to compute it. This proposed high-order bipartite graph proximity matrix contains high-order proximity information and can significantly reduce the computational complexity of the whole clustering process. Furthermore, we introduce an efficient and simple high-order bipartite graph fusion framework that can adaptively assign weights to each order of the high-order bipartite graph matrices. Finally, under the Laplace rank constraint, a consensus structured bipartite graph proximity matrix is obtained. At the same time, an efficient solution algorithm is proposed for this model. The model's efficacy is underscored through rigorous experiments, highlighting its superior clustering performance and time efficiency.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 8","pages":"4649-4663"},"PeriodicalIF":8.9,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144572953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huijuan Xiao;Shixi Yang;Kai Zhang;Yinan Jing;Zhenying He;X. Sean Wang
{"title":"Genie: A Lightweight Serverless Infrastructure for In-Memory Key-Value Caching With Fine-Grained and Prompt Elasticity","authors":"Huijuan Xiao;Shixi Yang;Kai Zhang;Yinan Jing;Zhenying He;X. Sean Wang","doi":"10.1109/TKDE.2025.3556427","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3556427","url":null,"abstract":"An increasing number of web applications require cloud in-memory key-value stores to minimize latency and achieve higher throughput. They generally have diverse characteristics and constantly changing traffic volumes, which require different computational and memory resources. A serverless in-memory key-value store characterized by elastic resource allocation and pay-as-you-go billing could satisfy the requirements of diverse and dynamic workloads. However, we find current serverless IMKVs fail to achieve fine-grained and prompt resource elasticity due to the limitations of their infrastructures. This paper proposes Genie, a lightweight serverless infrastructure for in-memory key-value caching with fine-grained and immediate elasticity. In Genie, a novel approach is adopted to enable dynamic and independent resource allocation to multiple tenants. It processes all arrived requests and estimates the vCPU consumption with a lightweight machine-learning approach for fine-grained billing. Moreover, Genie estimates the working set and dynamically resizes the allocated memory for hit ratio requirements. Evaluation results show that CPU estimation could be achieved every 100 microseconds without impacting system performance, and memory capacity could be adjusted by megabytes within seconds. The holistic design incurs 1% -2% performance degradation compared to our baseline. Moreover, Genie achieves an average of 58.3% CPU and 49.9% memory savings compared to AsparaDB for Memcache.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4089-4103"},"PeriodicalIF":8.9,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144219593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Lin;Jilin Hu;Shengnan Guo;Bin Yang;Christian S. Jensen;Youfang Lin;Huaiyu Wan
{"title":"UVTM: Universal Vehicle Trajectory Modeling With ST Feature Domain Generation","authors":"Yan Lin;Jilin Hu;Shengnan Guo;Bin Yang;Christian S. Jensen;Youfang Lin;Huaiyu Wan","doi":"10.1109/TKDE.2025.3570428","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3570428","url":null,"abstract":"Vehicle movement is frequently captured in the form of GPS trajectories, i.e., sequences of timestamped GPS locations. Such data is widely used for various tasks such as travel-time estimation, trajectory recovery, and trajectory prediction. A universal vehicle trajectory model could be applied to different tasks, removing the need to maintain multiple specialized models, thereby reducing computational and storage costs. However, creating such a model is challenging when the integrity of trajectory features is compromised, i.e., in scenarios where only partial features are available or the trajectories are sparse. To address these challenges, we propose the Universal Vehicle Trajectory Model (UVTM), which can effectively adapt to different tasks without excessive retraining. UVTM incorporates two specialized designs. First, it divides trajectory features into three distinct domains. Each domain can be masked and generated independently to accommodate tasks with only partially available features. Second, UVTM is pre-trained by reconstructing dense, feature-complete trajectories from sparse, feature-incomplete counterparts, enabling strong performance even when the integrity of trajectory features is compromised. Experiments involving four representative trajectory-related tasks on three real-world vehicle trajectory datasets provide insight into the performance of UVTM and offer evidence that it is capable of meeting its objectives.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 8","pages":"4894-4907"},"PeriodicalIF":8.9,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144573001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ST-LLM+: Graph Enhanced Spatio-Temporal Large Language Models for Traffic Prediction","authors":"Chenxi Liu;Kethmi Hirushini Hettige;Qianxiong Xu;Cheng Long;Shili Xiang;Gao Cong;Ziyue Li;Rui Zhao","doi":"10.1109/TKDE.2025.3570705","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3570705","url":null,"abstract":"Traffic prediction is a crucial component of data management systems, leveraging historical data to learn spatio-temporal dynamics for forecasting future traffic and enabling efficient decision-making and resource allocation. Despite efforts to develop increasingly complex architectures, existing traffic prediction models often struggle to generalize across diverse datasets and contexts, limiting their adaptability in real-world applications. In contrast to existing traffic prediction models, large language models (LLMs) progress mainly through parameter expansion and extensive pre-training while maintaining their fundamental structures. In this paper, we propose ST-LLM+, the graph enhanced spatio-temporal large language models for traffic prediction. Through incorporating a proximity-based adjacency matrix derived from the traffic network into the calibrated LLMs, ST-LLM+ captures complex spatio-temporal dependencies within the traffic network. The Partially Frozen Graph Attention (PFGA) module is designed to retain global dependencies learned during LLMs pre-training while modeling localized dependencies specific to the traffic domain. To reduce computational overhead, ST-LLM+ adopts the LoRA-augmented training strategy, allowing attention layers to be fine-tuned with fewer learnable parameters. Comprehensive experiments on real-world traffic datasets demonstrate that ST-LLM+ outperforms state-of-the-art models. In particular, ST-LLM+ also exhibits robust performance in both few-shot and zero-shot prediction scenarios. Additionally, our case study demonstrates that ST-LLM+ captures global and localized dependencies between stations, verifying its effectiveness for traffic prediction tasks.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 8","pages":"4846-4859"},"PeriodicalIF":8.9,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144573006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incomplete Multi-View Clustering via Multi-Level Contrastive Learning","authors":"Jun Yin;Pei Wang;Shiliang Sun;Zhonglong Zheng","doi":"10.1109/TKDE.2025.3568795","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3568795","url":null,"abstract":"Although significant progress has been made in multi-view learning over the past few decades, it remains challenging, especially in the context of incomplete multi-view clustering, where modeling complex correlations among different views and handling missing data are key difficulties. In this paper, we propose a novel incomplete multi-view clustering network to address the aforementioned issue, named Incomplete Multi-view Clustering via Multi-level Contrastive Learning (IMC-MCL). Specifically, the proposed model aims to minimize the conditional entropy between views to recover missing data by dual prediction strategy. Moreover, the approach learns multi-level features, including latent, high-level and semantic features, with the goal of satisfying both reconstruction and consistency objectives in distinct feature spaces. Specifically, latent features are utilized to accomplish the reconstruction objective, while high-level features and semantic labels are employed to achieve the two consistency goals through contrastive learning. This framework enables the exploration of shared semantics within high-level features and achieves clustering assignment using semantic features. Extensive experiments have shown that the proposed approach outperforms other state-of-the-art incomplete multi-view clustering methods on seven challenging datasets.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 8","pages":"4716-4727"},"PeriodicalIF":8.9,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144572988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Final: Combining First-Order Logic With Natural Logic for Question Answering","authors":"Jihao Shi;Xiao Ding;Siu Cheung Hui;Yuxiong Yan;Hengwei Zhao;Ting Liu;Bing Qin","doi":"10.1109/TKDE.2025.3551231","DOIUrl":"https://doi.org/10.1109/TKDE.2025.3551231","url":null,"abstract":"Many question-answering problems can be approached as textual entailment tasks, where the hypotheses are formed by the question and candidate answers, and the premises are derived from an external knowledge base. However, current neural methods often lack transparency in their decision-making processes. Moreover, first-order logic methods, while systematic, struggle to integrate unstructured external knowledge. To address these limitations, we propose a neuro-symbolic reasoning framework called <italic><small>Final</small></i>, which combines <underline><b>FI</b></u>rst-order logic with <underline><b>NA</b></u>tural <underline><b>L</b></u>ogic for question answering. Our framework utilizes <italic>first-order logic</i> to systematically decompose hypotheses and <italic>natural logic</i> to construct reasoning paths from premises to hypotheses, employing bidirectional reasoning to establish links along the reasoning path. This approach not only enhances interpretability but also effectively integrates unstructured knowledge. Our experiments on three benchmark datasets, namely QASC, WorldTree, and WikiHop, demonstrate that <sc>Final</small> outperforms existing methods in commonsense reasoning and reading comprehension tasks, achieving state-of-the-art results. Additionally, our framework also provides transparent reasoning paths that elucidate the rationale behind the correct decisions.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 6","pages":"3103-3117"},"PeriodicalIF":8.9,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}