{"title":"Parallelly Running and Privacy-Preserving Agglomerative Hierarchical Clustering in Outsourced Cloud Computing Environments","authors":"Jeongsu Park;Dong Hoon Lee","doi":"10.1109/TBDATA.2024.3403375","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3403375","url":null,"abstract":"As a Big Data analysis technique, hierarchical clustering is helpful in summarizing data since it returns the clusters of the data and their clustering history. Cloud computing is the most suitable option to efficiently perform hierarchical clustering over numerous data. However, since compromised cloud service providers can cause serious privacy problems by revealing data, it is necessary to solve the problems prior to using the external cloud computing service. Privacy-preserving hierarchical clustering protocol in an outsourced computing environment has never been proposed in existing works. Existing protocols have several problems that limit the number of participating data owners or disclose the information of data. In this article, we propose a parallelly running and privacy-preserving agglomerative hierarchical clustering (ppAHC) over the union of datasets of multiple data owners in an outsourced computing environment, which is the first protocol to the best of our knowledge. The proposed ppAHC does not disclose any information about input and output, including the data access patterns. The proposed ppAHC is highly efficient and suitable for Big Data analysis to handle numerous data since its cost for one round is independent of the amount of data. It allows data owners without sufficient computing capability to participate in a collaborative hierarchical clustering.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"174-189"},"PeriodicalIF":7.5,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Training Large-Scale Graph Neural Networks via Graph Partial Pooling","authors":"Qi Zhang;Yanfeng Sun;Shaofan Wang;Junbin Gao;Yongli Hu;Baocai Yin","doi":"10.1109/TBDATA.2024.3403380","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3403380","url":null,"abstract":"Graph Neural Networks (GNNs) are powerful tools for graph representation learning, but they face challenges when applied to large-scale graphs due to substantial computational costs and memory requirements. To address scalability limitations, various methods have been proposed, including sampling-based and decoupling-based methods. However, these methods have their limitations: sampling-based methods inevitably discard some link information during the sampling process, while decoupling-based methods require alterations to the model's structure, reducing their adaptability to various GNNs. This paper proposes a novel graph pooling method, Graph Partial Pooling (GPPool), for scaling GNNs to large-scale graphs. GPPool is a versatile and straightforward technique that enhances training efficiency while simultaneously reducing memory requirements. GPPool constructs small-scale pooled graphs by pooling partial nodes into supernodes. Each pooled graph consists of supernodes and unpooled nodes, preserving valuable local and global information. Training GNNs on these graphs reduces memory demands and enhances their performance. Additionally, this paper provides a theoretical analysis of training GNNs using GPPool-constructed graphs from a graph diffusion perspective. It shows that a GNN can be transformed from a large-scale graph into pooled graphs with minimal approximation error. A series of experiments on datasets of varying scales demonstrates the effectiveness of GPPool.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"221-233"},"PeriodicalIF":7.5,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Li;Yuhua Qian;Feijiang Li;Xinyan Liang;Zhi-Hui Zhan
{"title":"Feature Subspace Learning-Based Binary Differential Evolution Algorithm for Unsupervised Feature Selection","authors":"Tao Li;Yuhua Qian;Feijiang Li;Xinyan Liang;Zhi-Hui Zhan","doi":"10.1109/TBDATA.2024.3378090","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3378090","url":null,"abstract":"It is a challenging task to select the informative features that can maintain the manifold structure in the original feature space. Many unsupervised feature selection methods still suffer the poor cluster performance in the selected feature subset. To tackle this problem, a feature subspace learning-based binary differential evolution algorithm is proposed for unsupervised feature selection. First, a new unsupervised feature selection framework based on evolutionary computation is designed, in which the feature subspace learning and the population search mechanism are combined into a unified unsupervised feature selection. Second, a local manifold structure learning strategy and a sample pseudo-label learning strategy are presented to calculate the importance of the selected feature subspace. Third, the binary differential evolution algorithm is developed to optimize the selected feature subspace, in which the binary information migration mutation operator and the adaptive crossover operator are designed to promote the searching for the global optimal feature subspace. Experimental results on various types of real-world datasets demonstrate that the proposed algorithm can obtain more informative feature subset and competitive cluster performance compared with eight state-of-the-art unsupervised feature selection methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"99-114"},"PeriodicalIF":7.5,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning From Crowds Using Graph Neural Networks With Attention Mechanism","authors":"Jing Zhang;Ming Wu;Zeyi Sun;Cangqi Zhou","doi":"10.1109/TBDATA.2024.3378100","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3378100","url":null,"abstract":"Crowdsourcing has been playing an essential role in machine learning since it can obtain a large number of labels in an economical and fast manner for training increasingly complex learning models. However, the application of crowdsourcing learning still faces several challenges such as the low quality of crowd labels and the urgent requirement for learning models adapting to the label noises. There have been many studies focusing on truth inference algorithms to improve the quality of labels obtained by crowdsourcing. Comparably, end-to-end predictive model learning in crowdsourcing scenarios, especially using cutting-edge deep learning techniques, is still in its infant stage. In this paper, we propose a novel graph convolutional network-based framework, namely CGNNAT, which models the correlation of instances by combining the GCN model with an attention mechanism to learn more representative node embeddings for a better understanding of the bias tendency of crowd workers. Furthermore, a specific projection processing layer is employed in CGNNAT to model the reliability of each crowd worker, which makes the model an end-to-end neural network directly trained by noisy crowd labels. Experimental results on several real-world and synthetic datasets show that the proposed CGNNAT outperforms state-of-the-art and classical methods in terms of label prediction.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"86-98"},"PeriodicalIF":7.5,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunpeng Xiao;Xufeng Li;Tun Li;Rong Wang;Yucai Pang;Guoyin Wang
{"title":"A Distributed Generative Adversarial Network for Data Augmentation Under Vertical Federated Learning","authors":"Yunpeng Xiao;Xufeng Li;Tun Li;Rong Wang;Yucai Pang;Guoyin Wang","doi":"10.1109/TBDATA.2024.3375150","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3375150","url":null,"abstract":"Vertical federated learning can aggregate participant data features. To address the issue of insufficient overlapping data in vertical federated learning, this study presents a generative adversarial network model that allows distributed data augmentation. First, this study proposes a distributed generative adversarial network FeCGAN for multiple participants with insufficient overlapping data, considering the fact that the generative adversarial network can generate simulation samples. This network is suitable for multiple data sources and can augment participants’ local data. Second, to address the problem of learning divergence caused by different local distributions of multiple data sources, this study proposes the aggregation algorithm FedKL. It aggregates the feedback of the local discriminator to interact with the generator and learns the local data distribution more accurately. Finally, given the problem of data waste caused by the unavailability of nonoverlapping data, this study proposes a data augmentation method called VFeDA. It uses FeCGAN to generate pseudo features and expands more overlapping data, thereby improving the data use. Experiments showed that the proposed model is suitable for multiple data sources and can generate high-quality data.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"74-85"},"PeriodicalIF":7.5,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PR3: Reversible and Usability-Enhanced Visual Privacy Protection via Thumbnail Preservation and Data Hiding","authors":"Ruoyu Zhao;Yushu Zhang;Wenying Wen;Xinpeng Zhang;Xiaochun Cao;Yong Xiang","doi":"10.1109/TBDATA.2024.3375155","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3375155","url":null,"abstract":"The image hosting platform is becoming increasingly popular due to its user-friendly features, but it is prone to causing privacy concerns. Only protecting privacy, in fact, can be easy to come true, but usability is frequently sacrificed. Visual privacy protection schemes aim to make a balance between privacy and usability, whereas they are often irreversible. Recently, some reversible visual privacy protection schemes have been proposed by preserving thumbnails (known as TPE). However, they either have excessive states in the Markov chain modeled by the scheme or cannot reverse losslessly. Meanwhile, images encrypted by existing TPE schemes can not embed additional information and thus the usability is limited to visual observation. In view of this, we pertinently propose a reversible and usability-enhanced visual privacy protection scheme (called PR3) based on thumbnail preservation and data hiding. In this scheme, we utilize the sum-preserving data embedding algorithm to substitute the the lowest seven bits of the image without changing the sum. Any data overflow resulting from the above process is stored in the vacated space of the most significant bits. The remaining space serves two purposes: embedding additional information and adjusting the image to approximate the thumbnail. Compared with existing TPE works, PR3 has fewer states in the Markov chain and supports lossless recovery of images. In addition, additional information can be embedded in the encrypted image to enhance usability.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"59-73"},"PeriodicalIF":7.5,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yeyu Yan;Zhongying Zhao;Zhan Yang;Yanwei Yu;Chao Li
{"title":"A Fast and Robust Attention-Free Heterogeneous Graph Convolutional Network","authors":"Yeyu Yan;Zhongying Zhao;Zhan Yang;Yanwei Yu;Chao Li","doi":"10.1109/TBDATA.2024.3375152","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3375152","url":null,"abstract":"Due to the widespread applications of heterogeneous graphs in the real world, heterogeneous graph neural networks (HGNNs) have developed rapidly and made a great success in recent years. To effectively capture the complex interactions in heterogeneous graphs, various attention mechanisms are widely used in designing HGNNs. However, the employment of these attention mechanisms brings two key problems: high computational complexity and poor robustness. To address these problems, we propose a \u0000<bold>Fast</b>\u0000 and \u0000<bold>Ro</b>\u0000bust attention-free \u0000<bold>H</b>\u0000eterogeneous \u0000<bold>G</b>\u0000raph \u0000<bold>C</b>\u0000onvolutional \u0000<bold>N</b>\u0000etwork (FastRo-HGCN) without any attention mechanisms. Specifically, we first construct virtual links based on the topology similarity and feature similarity of the nodes to strengthen the connections between the target nodes. Then, we design type normalization to aggregate and transfer the intra-type and inter-type node information. The above methods are used to reduce the interference of noisy information. Finally, we further enhance the robustness and relieve the negative effects of oversmoothing with the self-loops of nodes. Extensive experimental results on three real-world datasets fully demonstrate that the proposed FastRo-HGCN significantly outperforms the state-of-the-art models.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"669-681"},"PeriodicalIF":7.5,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FuzzyPPI: Large-Scale Interaction of Human Proteome at Fuzzy Semantic Space","authors":"Anup Kumar Halder;Soumyendu Sekhar Bandyopadhyay;Witold Jedrzejewski;Subhadip Basu;Jacek Sroka","doi":"10.1109/TBDATA.2024.3375149","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3375149","url":null,"abstract":"Large-scale protein-protein interaction (PPI) network of an organism provides key insights into its cellular and molecular functionalities, signaling pathways and underlying disease mechanisms. For any organism, the total unexplored protein interactions significantly outnumbers all known positive and negative interactions. For Human, all known PPI datasets contain only <inline-formula><tex-math>$sim!! 5.61$</tex-math></inline-formula> million positive and <inline-formula><tex-math>$sim!! 0.76$</tex-math></inline-formula> million negative interactions, which is <inline-formula><tex-math>$sim!! 3.1$</tex-math></inline-formula>% of potential interactions. We have implemented a distributed algorithm in Apache Spark that evaluates a Human PPI network of <inline-formula><tex-math>$sim !! 180$</tex-math></inline-formula> million potential interactions resulting from 18 994 reviewed proteins for which Gene Ontology (GO) annotations are available. The computed scores have been validated against <i>state-of-the-art</i> methods on benchmark datasets. FuzzyPPI performed significantly better with an average F1 score of 0.62 compared to GOntoSim (0.39), GOGO (0.38), and Wang (0.38) when tested with the Gold Standard PPI Dataset. The resulting scores are published with a web server for non-commercial use at <uri>http://fuzzyppi.mimuw.edu.pl/</uri>. Moreover, conventional PPI prediction methods produce binary results, but in fact this is just a simplification as PPIs have strengths or probabilities and recent studies show that protein binding affinities may prove to be effective in detecting protein complexes, disease association analysis, signaling network reconstruction, etc. Keeping these in mind, our algorithm is based on a fuzzy semantic scoring function and produces probabilities of interaction.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"47-58"},"PeriodicalIF":7.5,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FAER: Fairness-Aware Event-Participant Recommendation in Event-Based Social Networks","authors":"Yuan Liang","doi":"10.1109/TBDATA.2024.3372409","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3372409","url":null,"abstract":"The \u0000<underline>e</u>\u0000vent-\u0000<underline>b</u>\u0000ased \u0000<underline>s</u>\u0000ocial \u0000<underline>n</u>\u0000etwork (EBSN) is a new type of social network that combines online and offline networks. In recent years, an important task in EBSN recommendation systems has been to design better and more reasonable recommendation algorithms to improve the accuracy of recommendation and enhance user satisfaction. However, the current research seldom considers how to coordinate fairness among individual users and reduce the impact of individual unreasonable feedback in group event recommendation. In addition, when considering the fairness to individuals, the accuracy of recommendation is not greatly improved by fully incorporating the key context information. To solve these problems, we propose a prefiltering algorithm to filter the candidate event set, a multidimensional context recommendation method to provide personalized event recommendations for each user in the group, and a group consensus function fusion strategy to fuse the recommendation results of the members of the group. To improve overall satisfaction with the recommendations, we propose a ranking adjustment strategy for the key context. Finally, we verify the effectiveness of our proposed algorithm on real data sets and find that FAER is superior to the latest algorithms in terms of global satisfaction, distance satisfaction and user fairness.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"655-668"},"PeriodicalIF":7.5,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amit Sagu;Nasib Singh Gill;Preeti Gulia;Ishaani Priyadarshini;Jyotir Moy Chatterjee
{"title":"Hybrid Optimization Algorithm for Detection of Security Attacks in IoT-Enabled Cyber-Physical Systems","authors":"Amit Sagu;Nasib Singh Gill;Preeti Gulia;Ishaani Priyadarshini;Jyotir Moy Chatterjee","doi":"10.1109/TBDATA.2024.3372368","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3372368","url":null,"abstract":"The Internet of Things (IoT) is being prominently used in smart cities and a wide range of applications in society. The benefits of IoT are evident, but cyber terrorism and security concerns inhibit many organizations and users from deploying it. Cyber-physical systems that are IoT-enabled might be difficult to secure since security solutions designed for general information/operational technology systems may not work as well in an environment. Thus, deep learning (DL) can assist as a powerful tool for building IoT-enabled cyber-physical systems with automatic anomaly detection. In this paper, two distinct DL models have been employed i.e., Deep Belief Network (DBN) and Convolutional Neural Network (CNN), considered hybrid classifiers, to create a framework for detecting attacks in IoT-enabled cyber-physical systems. However, DL models need to be trained in such a way that will increase their classification accuracy. Therefore, this paper also aims to present a new hybrid optimization algorithm called “Seagull Adapted Elephant Herding Optimization” (SAEHO) to tune the weights of the hybrid classifier. The “Hybrid Classifier + SAEHO” framework takes the feature extracted dataset as an input and classifies the network as either attack or benign. Using sensitivity, precision, accuracy, and specificity, two datasets were compared. In every performance metric, the proposed framework outperforms conventional methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"35-46"},"PeriodicalIF":7.5,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}