{"title":"Denoised Graph Collaborative Filtering via Neighborhood Similarity and Dynamic Thresholding","authors":"Haibo Ye;Lijun Zhang;Yuan Yao;Sheng-Jun Huang","doi":"10.1109/TBDATA.2024.3453765","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3453765","url":null,"abstract":"Graph collaborative filtering (GCF) has achieved great success in recommender systems due to its ability in mining high-order collaborative signals from historical user-item interactions. However, GCF's performance could be severely affected by the intrinsic noise within the user-item interactions. To this end, several denoised GCF frameworks have been proposed, whose heart is to estimate and handle the reliability of existing interactions. However, most of them suffer from two limitations: 1) the reliability computation itself is noisy, and 2) the reliability threshold is difficult to determine. To address the two limitations, in this paper, we propose a new \u0000<underline>N</u>\u0000eighborhood-\u0000<underline>i</u>\u0000nformed \u0000<underline>Den</u>\u0000oising framework NiDen for GCF. Specifically, for an existing user-item interaction, NiDen first estimates its reliability by employing the neighborhood information of the user and the item, and then determines whether the interaction is noisy or not via a dynamic thresholding strategy. After that, NiDen mitigates the negative impact of noise by both structure denoising and sample re-weighting. We instantiate NiDen on two representative GCF models and conduct extensive experiments on four widely-used datasets. The results show that NiDen achieves the best performance compared to the existing denoising methods, especially on datasets with heavy noise.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 6","pages":"683-693"},"PeriodicalIF":7.5,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning-Based Distributed Spatio-Temporal $k$k Nearest Neighbors Join","authors":"Ruiyuan Li;Jiajun Li;Minxin Zhou;Rubin Wang;Huajun He;Chao Chen;Jie Bao;Yu Zheng","doi":"10.1109/TBDATA.2024.3442539","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3442539","url":null,"abstract":"The rapid development of positioning technology produces an extremely large volume of spatio-temporal data with various geometry types such as point, line string, polygon, or a mixed combination of them. As one of the most fundamental but time-consuming operations, <inline-formula><tex-math>$k$</tex-math></inline-formula> nearest neighbors join (<inline-formula><tex-math>$k$</tex-math></inline-formula>NN join) has attracted much attention. However, most existing works for <inline-formula><tex-math>$k$</tex-math></inline-formula>NN join either ignore temporal information or consider only point data. Besides, most of them do not automatically adapt to the different features of spatio-temporal data. This paper proposes to address a novel and useful problem, i.e., ST-<inline-formula><tex-math>$k$</tex-math></inline-formula>NN join, which considers both <i>spatial closeness</i> and <i>temporal concurrency</i>. To support ST-<inline-formula><tex-math>$k$</tex-math></inline-formula>NN join over a large amount of spatio-temporal data with any geometry types efficiently, we propose a novel distributed solution based on Apache Spark. Specifically, our method adopts a two-round join framework. In the first round join, we propose a new spatio-temporal partitioning method that achieves spatio-temporal locality and load balance at the same time. We also propose a lightweight index structure, i.e., Time Range Count Index (TRC-index), to enable efficient ST-<inline-formula><tex-math>$k$</tex-math></inline-formula>NN join. In the second round join, to reduce the data transmission among different machines, we remove duplicates based on spatio-temporal reference points before shuffling local results. Furthermore, we design a set of models based on Bayesian optimization to automatically determine the values for the introduced parameters. Extensive experiments are conducted using three real big datasets, showing that our method is much more scalable and achieves 9X faster than baselines, and that the proposed models can always predict appropriate parameters for different datasets.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"861-878"},"PeriodicalIF":7.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Query-Aware Method for Approximate Range Search in Hamming Space","authors":"Yang Song;Yu Gu;Min Huang;Ge Yu","doi":"10.1109/TBDATA.2024.3436636","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3436636","url":null,"abstract":"The range search in Hamming space is to explore the binary vectors whose Hamming distances with a query vector are within a given searching threshold. It arises as the core component of many applications, such as image retrieval, pattern recognition, and machine learning. Existing searching methods in Hamming space require much pre-processing overhead, which are not suitable for processing multiple batches of incoming data in a short time. Moreover, significant pre-processing overhead can be a burden when the number of queries is relatively small. In this paper, we propose a query-aware method for the approximate range search in Hamming space with no pre-process. Specifically, to eliminate the impact of data skewness, we introduce JS-divergence to measure the divergence between data's distribution and query's distribution, and specially design a Query-Aware Dimension Partitioning (QADP) strategy to partition the dimensions into several subspaces according to the scales of given searching thresholds. In the subspaces, the candidates can be efficiently obtained by the basic Pigeonhole Principle and our proposed Anti-Pigeonhole Principle. Furthermore, a sampling strategy is designed to estimate the Hamming distance between the query vector and arbitrary binary vector to obtain the final approximate searching results among the candidates. Experimental results on four real-world datasets illustrate that, in comparison with benchmark methods, our method possesses the superior advantages on searching accuracy and efficiency. The proposed method can increase the searching efficiency up to nearly 16 times with high searching accuracy.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"848-860"},"PeriodicalIF":7.5,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hai Chen;Shu Zhao;Xiao Yang;Huanqian Yan;Yuan He;Hui Xue;Fulan Qian;Hang Su
{"title":"ANF: Crafting Transferable Adversarial Point Clouds via Adversarial Noise Factorization","authors":"Hai Chen;Shu Zhao;Xiao Yang;Huanqian Yan;Yuan He;Hui Xue;Fulan Qian;Hang Su","doi":"10.1109/TBDATA.2024.3436593","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3436593","url":null,"abstract":"Transfer-based adversarial attacks involve generating adversarial point clouds in surrogate models and transferring them to other models to assess 3D model robustness. However, current methods rely too much on surrogate model parameters, limiting transferability. In this work, we use Shapley value to identify positive and negative features, guiding optimization of adversarial noise in feature space. To effectively mislead the 3D classifier, we factorize the adversarial noise into positive and negative noise, with the former keeping the features of the adversarial point cloud close to the negative features, and the latter and the adversarial noise moving it away from the positive features. Finally, a novel adversarial point cloud attack method with Adversarial Noise Factorization is proposed, which is abbreviated as <b>ANF</b>. ANF simultaneously optimizes the adversarial noise and its positive and negative noise in the feature space, only relying on partial network parameters, which significantly reduces the reliance on the surrogate model and improves the transferability of the adversarial point cloud. Experiments on well-recognized benchmark datasets show that the transferability of adversarial point clouds generated by ANF could be improved by more than 26.7<inline-formula><tex-math>$%$</tex-math></inline-formula> on average over state-of-the-art transfer-based adversarial attack methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"835-847"},"PeriodicalIF":7.5,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143627854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yihui Li;Yuanfang Guo;Junfu Wang;Shihao Nie;Liang Yang;Di Huang;Yunhong Wang
{"title":"ALD-GCN: Graph Convolutional Networks With Attribute-Level Defense","authors":"Yihui Li;Yuanfang Guo;Junfu Wang;Shihao Nie;Liang Yang;Di Huang;Yunhong Wang","doi":"10.1109/TBDATA.2024.3433553","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3433553","url":null,"abstract":"Graph Neural Networks(GNNs), such as Graph Convolutional Network, have exhibited impressive performance on various real-world datasets. However, many researches have confirmed that deliberately designed adversarial attacks can easily confuse GNNs on the classification of target nodes (targeted attacks) or all the nodes (global attacks). According to our observations, different attributes tend to be differently treated when the graph is attacked. Unfortunately, most of the existing defense methods can only defend at the graph or node level, which ignores the diversity of different attributes within each node. To address this limitation, we propose to leverage a new property, named Attribute-level Smoothness (ALS), which is defined based on the local differences of graph. We then propose a novel defense method, named GCN with Attribute-level Defense (ALD-GCN), which utilizes the ALS property to provide attribute-level protection to each attributes. Extensive experiments on real-world graphs have demonstrated the superiority of the proposed work and the potentials of our ALS property in the attacks.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"788-799"},"PeriodicalIF":7.5,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Local High-Order Graph Learning for Multi-View Clustering","authors":"Zhi Wang;Qiang Lin;Yaxiong Ma;Xiaoke Ma","doi":"10.1109/TBDATA.2024.3433525","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3433525","url":null,"abstract":"As the accumulation of multi-view data continues to grow, multi-view clustering has become increasingly important in research fields like data mining. However, current methods have been criticized for their unsatisfactory performance, such as insufficient exploration of intra-view high-order relationships and poor characterization of inter-view diverse features. To overcome these challenges, we propose a novel approach called Local High-order Graph Learning for Multi-View Clustering (LHGL_MVC). Our method aims to explore high-order relationships within a view while also considering diverse information between views. In LHGL_MVC, we learn the initial graphs of each view through self-representation, which are decomposed into consistent and diverse parts to better capture the diversity of different views. Based on consistent parts, we propose a novel local high-order graph learning approach to more effectively explore high-order relationships between samples within each view. At the same time, we leverage high-order relationships between views using the rotated tensor nuclear norm. Finally, we obtain a unified graph for clustering by fusing all consistent affinity graphs and their high-order graphs with adaptive weights. All procedures are integrated into an overall objective function, which mutually promotes during the optimization process. The comprehensive experiments conducted on eleven real-world datasets demonstrate that LHGL_MVC significantly outperforms existing algorithms in various measurements, highlighting the superiority of the proposed method.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"761-773"},"PeriodicalIF":7.5,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AFS-FCM With Memory: A Model for Air Quality Multi-Dimensional Prediction With Interpretability","authors":"Zhen Peng;Wanquan Liu;Sung-Kwun Oh","doi":"10.1109/TBDATA.2024.3433467","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3433467","url":null,"abstract":"In order to represent the influences of different semantics on targets and improve the prediction with interpretability ability for multi-dimensional time series, we integrate Axiomatic Fuzzy Set (AFS) and Fuzzy Cognitive Map (FCM) with memory for fuzzy knowledge representation and prediction in this paper. The AFS is used to extract semantics of concepts for fuzzy representation using data distribution. The FCM with memory is trained to model the influence relationships between different semantics of concepts and multiple targets based on multi-dimensional time series data. And a multi- dimensional learning algorithm of AFS-FCM with memory based on gradient descent is developed to investigate the influences of different semantics of concepts on multiple targets. Finally, we validate our model by comparing with other FCMs, intrinsic interpretable models and machine learning methods for prediction of air quality multidimensional time series data, and discuss the performance of AFS-FCM with different transformation functions. The model can not only predict air quality accurately, but also explicitly reveal the specific quantitative relationship of different semantics of meteorology on air quality.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"810-820"},"PeriodicalIF":7.5,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenxiang Zhang;Ye Yuan;Hang Wei;Wenjing Zhang;Bin Liu
{"title":"A Systemic Pipeline of Identifying lncRNA-Disease Associations to the Prognosis and Treatment of Hepatocellular Carcinoma","authors":"Wenxiang Zhang;Ye Yuan;Hang Wei;Wenjing Zhang;Bin Liu","doi":"10.1109/TBDATA.2024.3433380","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3433380","url":null,"abstract":"Exploring disease mechanisms at the lncRNA level provides valuable guidance for disease prognosis and treatment. Recently, there has been a surge of interest in exploring disease mechanisms via computational methods to overcome the challenge of tremendous manpower and material resources in biological experiments. However, current computational methods suffer from two main limitations: simple data structures that do not consider the close association between multiple types of data, and the lack of a systematic pathogenesis analysis that identified disease-associated lncRNAs are not applied to the downstream disease prognosis and therapeutic analysis from the perspective of data analysis. In this end, we present a systemic pipeline including disease-associated lncRNAs identification and downstream pathogenesis analysis on how the predicted lncRNAs are involved in the disease prognosis and therapy. Due to the importance of identifying disease-associated lncRNAs and the weak interpretability of existing computational identification methods, we propose a novel approach named iLncDA-PT to identify disease-associated lncRNAs considering the interactions between various bio-entities outperforming the other state-of-the-art methods, and then we conduct a systematically subsequent analysis on prognosis and therapy for a specific disease, hepatocellular carcinoma (HCC), as an example. Finally, we reveal a significant association between immune checkpoint expression, tumor microenvironment, and drug treatment.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"800-809"},"PeriodicalIF":7.5,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Secret Specification Based Personalized Privacy-Preserving Analysis in Big Data","authors":"Jiajun Chen;Chunqiang Hu;Zewei Liu;Tao Xiang;Pengfei Hu;Jiguo Yu","doi":"10.1109/TBDATA.2024.3433433","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3433433","url":null,"abstract":"The pursuit of refined data analysis and the preservation of privacy in Big Data pose significant concerns. Among the paramount paradigms for addressing these challenges, differential privacy stands out as a vital area of research. However, traditional differential privacy tends to be excessively restrictive when it comes to individuals’ control over their own data. It often treats all data as inherently sensitive, whereas in reality, not all information related to individuals is sensitive and requires an identical level of protection. In this paper, we define secret specification-based differential privacy (SSDP), where the term “secret specification” implies enabling users to decide what aspects of their information are sensitive and what are not, prior to data generation or processing. By allowing individuals to independently define their secret specifications, the SSDP achieves personalized privacy protection and facilitates effective data analysis. To enable the targeted application of SSDP, we further present task-specific mechanisms designed for database and graph data scenarios. Finally, we assess the trade-offs between privacy and utility inherent in the proposed mechanisms through comparative experiments conducted on real datasets, demonstrating the utility enhancements offered by SSDP mechanisms in practical applications.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"774-787"},"PeriodicalIF":7.5,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal Deep Learning for Semisupervised Classification of Hyperspectral and LiDAR Data","authors":"Chunyu Pu;Yingxu Liu;Shuai Lin;Xu Shi;Zhengying Li;Hong Huang","doi":"10.1109/TBDATA.2024.3433494","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3433494","url":null,"abstract":"Deep learning (DL) has emerged as a competitive method in single-modality-dominated remote sensing (RS) data classification tasks, but its classification performance inevitably encounters a bottleneck due to the lack of representation diversity in complicated spatial structures with various land cover types. Therefore, the RS community has been actively researching multimodal feature learning techniques for the same scene. However, expert annotation of multisource data consumes a significant amount of time and cost. This article proposes an end-to-end method called semisupervised multimodal dual-path network (SMDN). This method simultaneously explores spatial-spectral features contained in hyperspectral images (HSI) and elevation information provided by light detection and ranging (LiDAR). SMDN exploits an unsupervised novel encoder-decoder structure as the backbone network to construct a multimodal DL architecture by jointly training with a data-specific branch. To obtain discriminative multimodal representations, SMDN is able to guide the collaborative training of two different unsupervised features mapped in the latent subspace with limited labeled training samples. Furthermore, after a simple modification of the fusion strategy in SMDN, it can be applied to unsupervised classification problems. Experimental results on benchmark RS datasets validate the effectiveness of the developed SMDN compared over many state-of-the-art methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"821-834"},"PeriodicalIF":7.5,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143629575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}