{"title":"Unsupervised Cross-View Subspace Clustering via Adaptive Contrastive Learning","authors":"Zihao Zhang;Qianqian Wang;Quanxue Gao;Chengquan Pei;Wei Feng","doi":"10.1109/TBDATA.2024.3366084","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366084","url":null,"abstract":"Cross-view subspace clustering has become a popular unsupervised method for cross-view data analysis because it can extract both the consistent and complementary features of data for different views. Nonetheless, existing methods usually ignore the discriminative features due to a lack of label supervision, which limits its further improvement in clustering performance. To address this issue, we design a novel model that leverages the self-supervision information embedded in the data itself by combining contrastive learning and self-expression learning, i.e., unsupervised cross-view subspace clustering via adaptive contrastive learning (CVCL). Specifically, CVCL employs an encoder to learn a latent subspace from the cross-view data and convert it to a consistent subspace with a self-expression layer. In this way, contrastive learning helps to provide more discriminative features for the self-expression learning layer, and the self-expression learning layer in turn supervises contrastive learning. Besides, CVCL adaptively chooses positive and negative samples for contrastive learning to reduce the noisy impact of improper negative sample pairs. Ultimately, the decoder is designed for reconstruction tasks, operating on the output of the self-expressive layer, and strives to faithfully restore the original data as much as possible, ensuring that the encoded features are potentially effective. Extensive experiments conducted across multiple cross-view datasets showcase the exceptional performance and superiority of our model.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"609-619"},"PeriodicalIF":7.5,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ConDTC: Contrastive Deep Trajectory Clustering for Fine-Grained Mobility Pattern Mining","authors":"Junjun Si;Jin Yang;Yang Xiang;Li Li;Bo Tu;Rongqing Zhang","doi":"10.1109/TBDATA.2024.3362195","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3362195","url":null,"abstract":"Trajectory clustering is a cornerstone task in the field of trajectory mining. With the proliferation of deep learning, deep trajectory clustering has been widely researched to mine mobility patterns from massive unlabeled trajectories. Nevertheless, existing methods mostly ignore trajectories’ temporal regularities, which are essential for mining fine-grained mobility patterns for applications including traveling group identification, transportation mode discovering, social security emergency, etc. To fill this gap, we propose ConDTC, a contrastive deep trajectory clustering method targeting for fine-grained mobility pattern mining. Specifically, we first design a spatial-temporal trajectory representation learning method which can capture both spatial and temporal regularities of trajectories synchronously. The proposed trajectory representation model can be used as a pre-trained model to serve various downstream trajectory mining tasks. Then, we construct a contrastive trajectory clustering module which optimizes trajectory representations and clustering performance simultaneously. Experimental results on three datasets validate that ConDTC can identify fine-grained mobility patterns by clustering trajectories with similar spatial-temporal mobility patterns together while separating those with different mobility patterns apart. Actually, ConDTC outperforms all state-of-the-art competitors substantially in terms of effectiveness, efficiency and robustness.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 2","pages":"333-344"},"PeriodicalIF":7.5,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143611824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ehsan Hallaji;Roozbeh Razavi-Far;Mehrdad Saif;Boyu Wang;Qiang Yang
{"title":"Decentralized Federated Learning: A Survey on Security and Privacy","authors":"Ehsan Hallaji;Roozbeh Razavi-Far;Mehrdad Saif;Boyu Wang;Qiang Yang","doi":"10.1109/TBDATA.2024.3362191","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3362191","url":null,"abstract":"Federated learning has been rapidly evolving and gaining popularity in recent years due to its privacy-preserving features, among other advantages. Nevertheless, the exchange of model updates and gradients in this architecture provides new attack surfaces for malicious users of the network which may jeopardize the model performance and user and data privacy. For this reason, one of the main motivations for decentralized federated learning is to eliminate server-related threats by removing the server from the network and compensating for it through technologies such as blockchain. However, this advantage comes at the cost of challenging the system with new privacy threats. Thus, performing a thorough security analysis in this new paradigm is necessary. This survey studies possible variations of threats and adversaries in decentralized federated learning and overviews the potential defense mechanisms. Trustability and verifiability of decentralized federated learning are also considered in this study.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 2","pages":"194-213"},"PeriodicalIF":7.2,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140123489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Hypergraph Structure Learning for Multivariate Time Series Forecasting","authors":"Shun Wang;Yong Zhang;Xuanqi Lin;Yongli Hu;Qingming Huang;Baocai Yin","doi":"10.1109/TBDATA.2024.3362188","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3362188","url":null,"abstract":"Multivariate time series forecasting plays an important role in many domain applications, such as air pollution forecasting and traffic forecasting. Modeling the complex dependencies among time series is a key challenging task in multivariate time series forecasting. Many previous works have used graph structures to learn inter-series correlations, which have achieved remarkable performance. However, graph networks can only capture spatio-temporal dependencies between pairs of nodes, which cannot handle high-order correlations among time series. We propose a Dynamic Hypergraph Structure Learning model (DHSL) to solve the above problems. We generate dynamic hypergraph structures from time series data using the K-Nearest Neighbors method. Then a dynamic hypergraph structure learning module is used to optimize the hypergraph structure to obtain more accurate high-order correlations among nodes. Finally, the hypergraph structures dynamically learned are used in the spatio-temporal hypergraph neural network. We conduct experiments on six real-world datasets. The prediction performance of our model surpasses existing graph network-based prediction models. The experimental results demonstrate the effectiveness and competitiveness of the DHSL model for multivariate time series forecasting.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"556-567"},"PeriodicalIF":7.5,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Charline Bour;Abir Elbeji;Luigi De Giovanni;Adrian Ahne;Guy Fagherazzi
{"title":"ALTRUIST: A Python Package to Emulate a Virtual Digital Cohort Study Using Social Media Data","authors":"Charline Bour;Abir Elbeji;Luigi De Giovanni;Adrian Ahne;Guy Fagherazzi","doi":"10.1109/TBDATA.2024.3362193","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3362193","url":null,"abstract":"Epidemiological cohort studies play a crucial role in identifying risk factors for various outcomes among participants. These studies are often time-consuming and costly due to recruitment and long-term follow-up. Social media (SM) data has emerged as a valuable complementary source for digital epidemiology and health research, as online communities of patients regularly share information about their illnesses. Unlike traditional clinical questionnaires, SM offer unstructured but insightful information about patients’ disease burden. Yet, there is limited guidance on analyzing SM data as a prospective cohort. We presented the concept of virtual digital cohort studies (VDCS) as an approach to replicate cohort studies using SM data. In this paper, we introduce ALTRUIST, an open-source Python package enabling standardized generation of VDCS on SM. ALTRUIST facilitates data collection, preprocessing, and analysis steps that mimic a traditional cohort study. We provide a practical use case focusing on diabetes to illustrate the methodology. By leveraging SM data, which offers large-scale and cost-effective information on users’ health, we demonstrate the potential of VDCS as an essential tool for specific research questions. ALTRUIST is customizable and can be applied to data from various online communities of patients, complementing traditional epidemiological methods and promoting minimally disruptive health research.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"568-575"},"PeriodicalIF":7.5,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10420428","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine-Tuned Personality Federated Learning for Graph Data","authors":"Meiting Xue;Zian Zhou;Pengfei Jiao;Huijun Tang","doi":"10.1109/TBDATA.2024.3356388","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3356388","url":null,"abstract":"Federated Learning (FL) empowers multiple clients to collaboratively learn a global generalization model without the need to share their local data, thus reducing privacy risks and expanding the scope of AI applications. However, current works focus less on data in a highly nonidentically distributed manner such as graph data which are common in reality, and ignore the problem of model personalization between clients for graph data training in federated learning. In this paper, we propose a novel personality graph federated learning framework based on variational graph autoencoders that incorporates model contrastive learning and local fine-tuning to achieve personalized federated training on graph data for each client, which is called FedVGAE. Then we introduce an encoder-sharing strategy to the proposed framework that shares the parameters of the encoder layer to further improve personality performance. The node classification and link prediction experiments demonstrate that our method achieves better performance than other federated learning methods on most graph datasets in the non-iid setting. Finally, we conduct ablation experiments, the result demonstrates the effectiveness of our proposed method.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"313-319"},"PeriodicalIF":7.2,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dawei Dai;Yingge Liu;Yutang Li;Shiyu Fu;Shuyin Xia;Guoyin Wang
{"title":"LGRL: Local-Global Representation Learning for On-the-Fly FG-SBIR","authors":"Dawei Dai;Yingge Liu;Yutang Li;Shiyu Fu;Shuyin Xia;Guoyin Wang","doi":"10.1109/TBDATA.2024.3356393","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3356393","url":null,"abstract":"On-the-fly Fine-grained sketch-based image retrieval (On-the-fly FG-SBIR) framework aim to break the barriers that sketch drawing requires excellent skills and is time-consuming. Considering such problems, a partial sketch with fewer strokes contains only the little local information, and the drawing process may show great difference among users, resulting in poor performance at the early retrieval. In this study, we developed a local-global representation learning (LGRL) method, in which we learn the representations for both the local and global regions of the partial sketch and its target photos. Specifically, we first designed a triplet network to learn the joint embedding space shared between the local and global regions of the entire sketch and its corresponding region of the photo. Then, we divided each partial sketch in the sketch-drawing episode into several local regions; Another learnable module following the triplet network was designed to learn the representations for the local regions of the partial sketch. Finally, by combining both the local and global regions of the sketches and photos, the final distance was determined. In the experiments, our method outperformed state-of-the-art baseline methods in terms of early retrieval efficiency on two publicly sketch-retrieval datasets and the practice test.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"543-555"},"PeriodicalIF":7.5,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GAT-COBO: Cost-Sensitive Graph Neural Network for Telecom Fraud Detection","authors":"Xinxin Hu;Haotian Chen;Junjie Zhang;Hongchang Chen;Shuxin Liu;Xing Li;Yahui Wang;Xiangyang Xue","doi":"10.1109/TBDATA.2024.3352978","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3352978","url":null,"abstract":"Along with the rapid evolution of mobile communication technologies, such as 5G, there has been a significant increase in telecom fraud, which severely dissipates individual fortune and social wealth. In recent years, graph mining techniques are gradually becoming a mainstream solution for detecting telecom fraud. However, the graph imbalance problem, caused by the Pareto principle, brings severe challenges to graph data mining. This emerging and complex issue has received limited attention in prior research. In this paper, we propose a \u0000<underline>G</u>\u0000raph \u0000<underline>AT</u>\u0000tention network with \u0000<underline>CO</u>\u0000st-sensitive \u0000<underline>BO</u>\u0000osting (GAT-COBO) for the graph imbalance problem. First, we design a GAT-based base classifier to learn the embeddings of all nodes in the graph. Then, we feed the embeddings into a well-designed cost-sensitive learner for imbalanced learning. Next, we update the weights according to the misclassification cost to make the model focus more on the minority class. Finally, we sum the node embeddings obtained by multiple cost-sensitive learners to obtain a comprehensive node representation, which is used for the downstream anomaly detection task. Extensive experiments on two real-world telecom fraud detection datasets demonstrate that our proposed method is effective for the graph imbalance problem, outperforming the state-of-the-art GNNs and GNN-based fraud detectors. In addition, our model is also helpful for solving the widespread over-smoothing problem in GNNs.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"528-542"},"PeriodicalIF":7.5,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Core Maintenance on Dynamic Graphs: A Distributed Approach Built on H-Index","authors":"Qiang-Sheng Hua;Hongen Wang;Hai Jin;Xuanhua Shi","doi":"10.1109/TBDATA.2024.3352973","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3352973","url":null,"abstract":"Core number is an essential tool for analyzing graph structure. Graphs in the real world are typically large and dynamic, requiring the development of distributed algorithms to refrain from expensive I/O operations and the maintenance algorithms to address dynamism. Core maintenance updates the core number of each vertex upon the insertion/deletion of vertices/edges. Although the state-of-the-art distributed maintenance algorithm (Weng et al.~2022) can handle multiple edge insertions/deletions simultaneously, it still has two aspects to improve. (I) Parallel processing is not allowed when inserting/removing edges with the same core number, reducing the degree of parallelism and raising the number of rounds. (II) During the implementation phase, only one thread is assigned to the vertices with the same core number, leading to the inability to fully utilize the distributed computing power. Furthermore, the h-index (Lü, et al. 2016) based distributed core decomposition algorithm (Montresor et al. 2013) can fully utilize the distributed computing power where all vertices can be processed in parallel. However, it requires all vertices to recompute their core numbers upon graph changes. In this article, we propose a distributed core maintenance algorithm based on h-index, which circumvents the issues of algorithm (Weng et al.~2022). In addition, our algorithm avoids core numbers recalculation where the numbers do not change. In comparison to the state-of-the-art distributed maintenance algorithm (Weng et al.~2022), the time speedup ratio is at least 100 in the scenarios of both insertion and deletion. Compared to the distributed core decomposition algorithm (Montresor et al. 2013), the average time speedup ratios are 2 and 8 for the cases of insertion and deletion, respectively.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"595-608"},"PeriodicalIF":7.5,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10388383","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Heterogeneous Streaming Feature Selection Without Feature Type Information","authors":"Peng Zhou;Yunyun Zhang;Zhaolong Ling;Yuanting Yan;Shu Zhao;Xindong Wu","doi":"10.1109/TBDATA.2024.3350630","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3350630","url":null,"abstract":"Feature selection aims to select an optimal minimal feature subset from the original datasets and has become an indispensable preprocessing component before data mining and machine learning, especially in the era of Big Data. However, features may be generated dynamically and arrive individually over time in practice, which we call streaming features. Most existing streaming feature selection methods assume that all dynamically generated features are the same type or assume we can know the feature type for each new arriving feature in advance, but this is unreasonable and unrealistic. Therefore, this paper first studies a practical issue of Online Heterogeneous Streaming Feature Selection without the feature type information before learning, named OHSFS. Specifically, we first model the streaming feature selection issue as a minimax problem. Then, in terms of MIC (Maximal Information Coefficient), we derive a new metric \u0000<inline-formula><tex-math>$MIC_{Gain}$</tex-math></inline-formula>\u0000 to determine whether a new streaming feature should be selected. To speed up the efficiency of OHSFS, we present the metric \u0000<inline-formula><tex-math>$MIC_{Cor}$</tex-math></inline-formula>\u0000 that can directly discard low correlation features. Finally, extensive experimental results indicate the effectiveness of OHSFS. Moreover, OHSFS is nonparametric and does not need to know the feature type before learning, which aligns with practical application needs.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"470-485"},"PeriodicalIF":7.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}