Weiqiang Jin;Mengying Jiang;Tao Tao;Hao Zhou;Xiaotian Wang;Biao Zhao;Guang Yang
{"title":"Can Rumor Detection Enhance Fact Verification? Unraveling Cross-Task Synergies Between Rumor Detection and Fact Verification","authors":"Weiqiang Jin;Mengying Jiang;Tao Tao;Hao Zhou;Xiaotian Wang;Biao Zhao;Guang Yang","doi":"10.1109/TBDATA.2024.3442555","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3442555","url":null,"abstract":"Recently, rumor detection (fake news detection) has seen a surge in research interest, and fact verification (fake news checking) has simultaneously become a significant research aspect. Despite the inherent distinction between fact verification and rumor detection – the former being a three-category task and the latter a binary one – there has yet to be in-depth exploration into the synergies between these two tasks. Furthermore, given the severe scarcity and the time-consuming and costly construction nature of fact verification datasets, few-shot/zero-shot fact verification methods are particularly favored. To tackle these challenges, we conduct a series of studies around “How can rumor detection enhance few-shot fact verification, and to what extent?”. Specifically, we systematically investigate the knowledge transferability between the two tasks, proposing a framework, Det2Ver, that is applicable to both rumor detection and fact verification. Through the construction of adaptive prompt templates and prompt-tuned LLMs like T5, Det2Ver structural-level synchronizes the two tasks and utilizes the external knowledge from rumor detection to reinforce fact verification task. We demonstrate the significance and effectiveness of Det2Ver. Through the few-shot/zero-shot experiments on three widely-used datasets, compared to other LLMs prompt-tuning baselines, the Det2Ver for cross-task knowledge augmentation brings a significant improvement in macro-F1 for fact verification.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1171-1187"},"PeriodicalIF":7.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient Multi-View Heterogeneous Hypergraph Convolutional Network for Heterogeneous Information Network Representation Learning","authors":"Rui Bing;Guan Yuan;Yanmei Zhang;Senzhang Wang;Bohan Li;Yong Zhou","doi":"10.1109/TBDATA.2024.3442549","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3442549","url":null,"abstract":"Heterogeneous hypergraph neural networks are powerful tools to capture complex correlations among various nodes in Heterogeneous Information Networks (HINs). Despite satisfied performances of them, they are still plagued by the following problems: 1) They cannot capture the correlations in structural and semantic view at once, leading to topological information loss. 2) Due to the number of nodes being greater than the number of node types, node-level self-attention they used causes massive parameters and leads to high time consumption. 3) Interactions in meta-paths may be redundant, resulting in the correlations bias. To address the three issues, we propose an efficient <u>M</u>ulti-<u>V</u>iew <u>H</u>eterogeneous <u>H</u>yper<u>g</u>raph <u>C</u>onvolutional <u>N</u>etwork (MVH <inline-formula><tex-math>$^{2}$</tex-math></inline-formula> GCN). It first constructs relational and semantic hypergraphs based on different types of edges and meta-paths respectively, to represent the complex correlations in structural view and semantic view. Meanwhile, the clean semantic hypergraphs are generated by structure learning network to avoid redundancy. Then, an efficient hypergraph convolutional network is designed to learn node embeddings. By doing so, correlations in the two views are captured. Finally, the learned node embeddings from two views are aggregated via a gated embedding fusion module for downstream tasks. Experiment results demonstrate that MVH <inline-formula><tex-math>$^{2}$</tex-math></inline-formula> GCN is effective and efficient.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1144-1157"},"PeriodicalIF":7.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Di Zang;Zhe Cui;Zengqiang Wang;Juntao Lei;Yongjie Ding;Chenguang Wei;Junqi Zhang
{"title":"Geometric Algebra Multi-Order Graph Neural Network for Traffic Prediction","authors":"Di Zang;Zhe Cui;Zengqiang Wang;Juntao Lei;Yongjie Ding;Chenguang Wei;Junqi Zhang","doi":"10.1109/TBDATA.2024.3442533","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3442533","url":null,"abstract":"Accurate traffic prediction is crucial for urban traffic management. Spatial-temporal graph neural networks, which combine graph neural networks with time series processing, have been extensively employed in traffic prediction. However, traditional graph neural networks only capture pairwise spatial relationships between road network nodes, neglecting high-order interactions among multiple nodes. Meanwhile, most work for extracting temporal dependencies suffers from implicit modeling and overlooks the internal and external dependencies of time series. To address these challenges, we propose a Geometric Algebraic Multi-order Graph Neural Network (GA-MGNN). Specifically, in the temporal dimension, we design a convolution kernel based on the rotation matrix of geometric algebra, which not only learns internal dependencies between different time steps in time series but also external dependencies between time series and convolution kernels. In the spatial dimension, we construct a tokenized hypergraph and integrate dynamic graph convolution with attention hypergraph convolution to comprehensively capture multi-order spatial dependencies. Additionally, we design a segmented loss function based on traffic periodic information to further improve prediction accuracy. Extensive experiments on seven real-world datasets demonstrate that GA-MGNN outperforms state-of-the-art baselines.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1206-1220"},"PeriodicalIF":7.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144090744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data-Free Knowledge Filtering and Distillation in Federated Learning","authors":"Zihao Lu;Junli Wang;Changjun Jiang","doi":"10.1109/TBDATA.2024.3442551","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3442551","url":null,"abstract":"In federated learning (FL), multiple parties collaborate to train a global model by aggregating their local models while keeping private training sets isolated. One problem hindering effective model aggregation is data heterogeneity. Federated ensemble distillation tackles this problem by using fused local-model knowledge to train the global model rather than directly averaging model parameters. However, most existing methods fuse all knowledge indiscriminately, which makes the global model inherit some data-heterogeneity-caused flaws from local models. While knowledge filtering is a potential coping method, its implementation in FL is challenging due to the lack of public data for knowledge validation. To address this issue, we propose a novel data-free approach (FedKFD) that synthesizes credible labeled data to support knowledge filtering and distillation. Specifically, we construct a prediction capability description to characterize the samples where a local model makes correct predictions. FedKFD explores the intersection of local-model-input space and prediction capability descriptions with a conditional generator to synthesize consensus-labeled proxy data. With these labeled data, we filter for relevant local-model knowledge and further train a robust global model through distillation. The theoretical analysis and extensive experiments demonstrate that our approach achieves improved generalization, superior performance, and compatibility with other FL efforts.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1128-1143"},"PeriodicalIF":7.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Moving Conditional GAN Close to Data: Synthetic Tabular Data Generation and Its Experimental Evaluation","authors":"Abdul Majeed;Seong Oun Hwang","doi":"10.1109/TBDATA.2024.3442534","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3442534","url":null,"abstract":"Recently, data has ousted oil as the most economical resource in the world, but most companies are reluctant to share customer/user data in pure form and on a large scale due to privacy concerns. Many innovative technologies (e.g., federated learning, split learning) are employed to meet the growing demand for privacy preservation. Despite these technologies, acquiring personal data in order to optimize utility, and then sharing it on a large scale, is still very challenging. Thanks to the rapid development of artificial intelligence (AI), a relatively new and promising solution to resolve these challenges is to generate synthetic data (SD) by mirroring the original dataset’s properties. SD is a promising solution to address growing privacy demands as well as the utility/analytics requirements of many industry stakeholders. In this paper, we propose and implement an SD generation method from a real dataset containing both numerical and categorical attributes by using an improved conditional generative adversarial network (CGAN), and we quantify the feasibility of SD on technical and theoretical grounds. We provide a detailed analysis of SD in original and anonymized forms with the help of multiple use cases, whereas prior research simply assumed that privacy issues in SD are small because AI models do not overfit or SD has a poor connection with real data. We provide insights into the characteristics of SD (distributions, value frequencies, correlations, etc.) produced by the CGAN in relation to the real data. To the best of our knowledge, this is the pioneering work that provides an experiment-based analysis of the quality, privacy, and utility of SD in relation to a real benchmark dataset.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1188-1205"},"PeriodicalIF":7.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Restoration of Recaptured Screen Images With a Divide and Conquer Strategy","authors":"Yujing Sun;Hao Xiong;Siu Ming Yiu","doi":"10.1109/TBDATA.2024.3442538","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3442538","url":null,"abstract":"Moiré patterns in recaptured screen images are image defects that can affect image quality to an extreme extent. Different from other image defects, moiré artefacts can vary greatly in scales, colours and shapes. Such moiré patterns mix with image content and disturb image features of different scales in different ways, making moiré pattern removal a challenging task. In this paper, we present a novel divide-and-conquer strategy to solve the problem. In the divide stage, we innovatively decompose images into different layers, as well as into structure components and detail components. Then in the conquer stage, guided by the layers retrieved from the divide stage, we can restore coarse and fine image components independently, which greatly improve the demoiréing performance. Our strategy outperforms state-of-arts in both quantitative and qualitative evaluations.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1103-1115"},"PeriodicalIF":7.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain Adaptation for Label Distribution Learning","authors":"Haitao Wu;Weiwei Li;Xiuyi Jia","doi":"10.1109/TBDATA.2024.3442562","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3442562","url":null,"abstract":"Label distribution learning (LDL) suffers from the dilemma of insufficient target data in real-world applications, while domain adaptation (DA) seems to be able to provide a solution. However, most existing methods of DA, assuming that the instances can correspond to the explicit class information, are devoted only to classification but not to LDL. We argue that indiscriminately applying such DA methods might cause performance degradation in LDL tasks. In this paper, we propose LDL-DA, a novel algorithm dedicated to supervised domain adaptation for label distribution learning, which jointly learns a shared encoding representation from two aspects: 1) contrastive alignment of scarce supervised target data, and 2) minimizing the distance between prototypes of the same label combination. Experiments show that LDL-DA outperforms existing DA methods adapted to LDL, and provides early positive results in DA for LDL. To the best of our knowledge, this paper is the first research on DA for LDL.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1221-1234"},"PeriodicalIF":7.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comprehensive Federated Learning Framework for Diabetic Retinopathy Grading and Lesion Segmentation","authors":"Jingxin Mao;Xiaoyu Ma;Yanlong Bi;Rongqing Zhang","doi":"10.1109/TBDATA.2024.3442548","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3442548","url":null,"abstract":"Diabetic retinopathy (DR) is a debilitating ocular complication demanding timely intervention and treatment. The rapid evolution of deep learning (DL) has notably enhanced the efficiency of conventional manual diagnosis. However, the scarcity of existing DR datasets hinders the progress of data-driven DL models, especially for pixel-level lesion annotation datasets, which severely impedes the advancement of DR lesion segmentation tasks required for precise interpretations of DR grading. Furthermore, the escalating concerns surrounding medical data security and privacy induce data collection challenges for traditional centralized learning, exacerbating the issue of data silos. Federated learning (FL) emerges as a privacy-preserving distributed learning paradigm. Nevertheless, the existing literature lacks a comprehensive FL framework for DR diagnosis and fails to exploit multiple diverse DR datasets simultaneously. To address the challenges of data scarcity and privacy, we construct a high-quality pixel-level DR lesion annotation dataset (TJDR) and propose a novel FL-based DR diagnosis framework including both DR grading and multi-lesion segmentation. Moreover, to tackle the scarcity of pixel-level DR lesion datasets, we propose <inline-formula><tex-math>$bm {alpha }$</tex-math></inline-formula>-Fed and adaptive-<inline-formula><tex-math>$bm {alpha }$</tex-math></inline-formula>-Fed, two efficient cross-dataset FL algorithms. Extensive experiments demonstrate the effectiveness of our proposed framework and the two cross-dataset FL algorithms.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1158-1170"},"PeriodicalIF":7.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Where is the Next Step? Predicting the Scientific Impact of Research Career","authors":"Hefu Zhang;Yong Ge;Yan Zhuang;Enhong Chen","doi":"10.1109/TBDATA.2024.3442550","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3442550","url":null,"abstract":"Predicting the scientific impact of research scholars is increasingly crucial for career planning, particularly for young scholars considering career transitions. However, predicting a scholar's future development, especially after they move to a different academic group, presents significant challenges. To tackle this issue, we propose a Future Publication Impact Prediction Network (FPIPN) based on graph neural networks. FPIPN leverages rich information from a heterogeneous academic graph for impact prediction. We employ a hierarchical attention mechanism to learn the significance of graph information and utilize a knowledge distillation strategy to assess future impact based on historical records. Extensive experiments on a real-world academic dataset showcase the effectiveness of our approach compared to state-of-the-art methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"1116-1127"},"PeriodicalIF":7.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143949327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Query-Aware Method for Approximate Range Search in Hamming Space","authors":"Yang Song;Yu Gu;Min Huang;Ge Yu","doi":"10.1109/TBDATA.2024.3436636","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3436636","url":null,"abstract":"The range search in Hamming space is to explore the binary vectors whose Hamming distances with a query vector are within a given searching threshold. It arises as the core component of many applications, such as image retrieval, pattern recognition, and machine learning. Existing searching methods in Hamming space require much pre-processing overhead, which are not suitable for processing multiple batches of incoming data in a short time. Moreover, significant pre-processing overhead can be a burden when the number of queries is relatively small. In this paper, we propose a query-aware method for the approximate range search in Hamming space with no pre-process. Specifically, to eliminate the impact of data skewness, we introduce JS-divergence to measure the divergence between data's distribution and query's distribution, and specially design a Query-Aware Dimension Partitioning (QADP) strategy to partition the dimensions into several subspaces according to the scales of given searching thresholds. In the subspaces, the candidates can be efficiently obtained by the basic Pigeonhole Principle and our proposed Anti-Pigeonhole Principle. Furthermore, a sampling strategy is designed to estimate the Hamming distance between the query vector and arbitrary binary vector to obtain the final approximate searching results among the candidates. Experimental results on four real-world datasets illustrate that, in comparison with benchmark methods, our method possesses the superior advantages on searching accuracy and efficiency. The proposed method can increase the searching efficiency up to nearly 16 times with high searching accuracy.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"848-860"},"PeriodicalIF":7.5,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}