{"title":"An End-to-End Approach for Graph-Based Multi-View Data Clustering","authors":"Fadi Dornaika;Sally El Hajjar","doi":"10.1109/TBDATA.2024.3371357","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3371357","url":null,"abstract":"Clustering data from different sources or views is a key challenge in real-world applications. While traditional graph-based methods are effective at capturing data structures, they often require separate steps to estimate graphs of views or a consensus graph from the raw data. This reliance on intermediate steps can make these clustering methods susceptible to noisy graphs, which affects the overall performance of clustering. In response to this limitation, and with an emphasis on advocating end-to-end solutions for multi-view clustering, two comprehensive approaches are presented in this paper. Each approach starts from either the raw data or its kernelized features. The first proposal introduces a unified objective function that enables the simultaneous recovery of the graph for each view, the unified graph, the spectral projection matrices for all views, the soft cluster assignments, and the scores assigned to each view. The second proposal uses a global criterion that integrates regularization and constraints for the soft cluster assignment matrix based on the consensus graph matrix and the consensus data representation. Both proposed methods enable direct and straightforward clustering of the data without the need for additional steps. Extensive tests with various real-world image and text datasets confirm the superior performance of the two proposed methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"644-654"},"PeriodicalIF":7.5,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TgStore: An Efficient Storage System for Large Time-Evolving Graphs","authors":"Yongli Cheng;Yan Ma;Hong Jiang;Lingfang Zeng;Fang Wang;Xianghao Xu;Yuhang Wu","doi":"10.1109/TBDATA.2024.3366087","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366087","url":null,"abstract":"Existing graph systems focus mainly on the execution efficiency of the graph analysis tasks, often ignoring the importance and efficiency of time-evolving graph storage. However, to effectively mine the potential application values, an efficient storage system is important for time-evolving graphs whose storage requirement scales with the increasing number of snapshots. Storage cost and snapshot access speed are the two most important performance indicators for a time-evolving graph storage system, which are challenging for designers of such systems because they are conflicting goals. In this article, we address these challenges by proposing an efficient storage scheme for the large time-evolving graphs. We first design a \u0000<italic>Snapshot-level Data Deduplication (SLDD)</i>\u0000 strategy to eliminate the large number of repeated vertices and edges among the snapshots, and then a \u0000<italic>Structure-Changing Graph Representation (SCGR)</i>\u0000 to significantly improve the snapshot access speed. We implement an efficient time-evolving graph storage system, TgStore, based on this scheme to effectively store large-scale time-evolving graphs, aiming to efficiently support the time-evolving graph analysis tasks. Experimental results show that TgStore can obtain a high compression ratio of 43.03:1 when storing 100 snapshots of Twitter, while with an average snapshot access speedup of 16×. Efficient storage scheme enables TgStore to efficiently support time-evolving graph algorithms. For example, when executing the Pagerank algorithm on the time-evolving graph of Twitter, TgStore outperforms Graphone, a state-of-the-art time-evolving graph storage system, by 15.9× in algorithm execution speed and 1.45× in memory usage.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 2","pages":"158-173"},"PeriodicalIF":7.2,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140123521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Souha Al Katat;Chamseddine Zaki;Hussein Hazimeh;Ibrahim El Bitar;Rafael Angarita;Lionel Trojman
{"title":"Natural Language Processing for Arabic Sentiment Analysis: A Systematic Literature Review","authors":"Souha Al Katat;Chamseddine Zaki;Hussein Hazimeh;Ibrahim El Bitar;Rafael Angarita;Lionel Trojman","doi":"10.1109/TBDATA.2024.3366083","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366083","url":null,"abstract":"Sentiment analysis involves using computational methods to identify and classify opinions expressed in text, with the goal of determining whether the writer's stance towards a particular topic, product, or idea is positive, negative, or neutral. However, sentiment analysis in Arabic presents unique challenges due to the complexity of Arabic morphology and the variety of dialects, which make language classification even more difficult. To address these challenges, we conducted to investigation and overview the techniques used in the last five years for embedding and classification of Arabic sentiment analysis (ASA). We collected data from 100 publications, resulting in a representative dataset of 2,300 detailed records that included attributes related to the dataset, feature extraction, approach, parameters, and performance measures. Our study aimed to identify the most powerful approaches and best model settings by analyzing the collected data to identify the significant parameters influencing performance. The results showed that Deep Learning and Machine Learning were the most commonly used techniques, followed by lexicon and transformer-based techniques. However, Deep Learning models were found to be more accurate for sentiment classification than other Machine Learning models. Furthermore, multi-level embedding was found to be a significant step in improving model accuracy.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"576-594"},"PeriodicalIF":7.5,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Edge-DPSDG: An Edge-Based Differential Privacy Protection Model for Smart Healthcare","authors":"Moli Lyu;Zhiwei Ni;Qian Chen;Fenggang Li","doi":"10.1109/TBDATA.2024.3366071","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366071","url":null,"abstract":"The edge computing paradigm has revolutionized the healthcare sector, providing more real-time medical data processing and analysis, which also poses more serious privacy and security risks that must be carefully considered and addressed. Based on differential privacy, we presented an innovative privacy-preserving model named Edge-DPSDG (Edge-Differentially Private Synthetic Data Generator) for smart healthcare under edge computing. It also develops and evolves a privacy budget allocation mechanism. In a distributed environment, the privacy budget for local medical data is personalized by computing the Shapley value and the information entropy value of each attribute in the dataset, which takes into account the trade-off between data privacy and utility. Extensive experiments on three public medical datasets are performed to evaluate the performance of Edge-DPSDG on two metrics. For utility evaluation, Edge-DPSDG shows a best 21.29% accuracy improvement compared to the state-of-the-art; our privacy budget allocation mechanism improved existing models’ accuracy by up to 6.05%. For privacy evaluation, Edge-DPSDG shows that can effectively ensure the privacy of the original datasets. In addition, Edge-DPSDG helps smooth the data, and results in a 3.99% accuracy loss decrease over the non-private model.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"21-34"},"PeriodicalIF":7.5,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ana Belén Rodríguez González;Javier Burrieza-Galán;Juan José Vinagre Díaz;Inés Peirats de Castro;Mark Richard Wilby;Oliva Garcia Cantú-Ros
{"title":"Using App Usage Data From Mobile Devices to Improve Activity-Based Travel Demand Models","authors":"Ana Belén Rodríguez González;Javier Burrieza-Galán;Juan José Vinagre Díaz;Inés Peirats de Castro;Mark Richard Wilby;Oliva Garcia Cantú-Ros","doi":"10.1109/TBDATA.2024.3366088","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366088","url":null,"abstract":"In the last years we have seen several studies showing the potential of mobile network data to reconstruct activity and mobility patterns of the population. These data sources allow continuous monitoring of the population with a higher degree of spatial and temporal resolution and at a lower cost compared with traditional methods. However, for certain applications, the spatial resolution of these data sources is still not enough since it typically provides a spatial resolution of hundreds of meters in urban areas and of few kilometers in rural areas. In this article, we fill this gap by proposing a methodology that utilises GPS data from the usage of different applications in mobile devices. This approach improves the spatial precision in the location of activities, previously identified with the mobile network data.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"633-643"},"PeriodicalIF":7.5,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10436340","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HGN2T: A Simple but Plug-and-Play Framework Extending HGNNs on Heterogeneous Temporal Graphs","authors":"Huan Liu;Pengfei Jiao;Xuan Guo;Huaming Wu;Mengzhou Gao;Jilin Zhang","doi":"10.1109/TBDATA.2024.3366085","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366085","url":null,"abstract":"Heterogeneous graphs (HGs) with multiple entity and relation types are common in real-world networks. Heterogeneous graph neural networks (HGNNs) have shown promise for learning HG representations. However, most HGNNs are designed for static HGs and are not compatible with heterogeneous temporal graphs (HTGs). A few existing works have focused on HTG representation learning but they care more about how to capture the dynamic evolutions and less about their compatibility with those well-designed static HGNNs. They also handle graph structure and temporal dependency learning separately, ignoring that HTG evolutions are influenced by both nodes and relationships. To address this, we propose HGN2T, a simple and general framework that makes static HGNNs compatible with HTGs. HGN2T is plug-and-play, enabling static HGNNs to leverage their graph structure learning strengths. To capture the relationship-influenced evolutions, we design a special mechanism coupling both the HGNN and sequential model. Finally, through joint optimization by both detection and prediction tasks, the learned representations can fully capture temporal dependencies from historical information. We conduct several empirical evaluation tasks, and the results show our HGN2T can adapt static HGNNs to HTGs and overperform existing methods for HTGs.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"620-632"},"PeriodicalIF":7.5,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised Cross-View Subspace Clustering via Adaptive Contrastive Learning","authors":"Zihao Zhang;Qianqian Wang;Quanxue Gao;Chengquan Pei;Wei Feng","doi":"10.1109/TBDATA.2024.3366084","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366084","url":null,"abstract":"Cross-view subspace clustering has become a popular unsupervised method for cross-view data analysis because it can extract both the consistent and complementary features of data for different views. Nonetheless, existing methods usually ignore the discriminative features due to a lack of label supervision, which limits its further improvement in clustering performance. To address this issue, we design a novel model that leverages the self-supervision information embedded in the data itself by combining contrastive learning and self-expression learning, i.e., unsupervised cross-view subspace clustering via adaptive contrastive learning (CVCL). Specifically, CVCL employs an encoder to learn a latent subspace from the cross-view data and convert it to a consistent subspace with a self-expression layer. In this way, contrastive learning helps to provide more discriminative features for the self-expression learning layer, and the self-expression learning layer in turn supervises contrastive learning. Besides, CVCL adaptively chooses positive and negative samples for contrastive learning to reduce the noisy impact of improper negative sample pairs. Ultimately, the decoder is designed for reconstruction tasks, operating on the output of the self-expressive layer, and strives to faithfully restore the original data as much as possible, ensuring that the encoded features are potentially effective. Extensive experiments conducted across multiple cross-view datasets showcase the exceptional performance and superiority of our model.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"609-619"},"PeriodicalIF":7.5,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ehsan Hallaji;Roozbeh Razavi-Far;Mehrdad Saif;Boyu Wang;Qiang Yang
{"title":"Decentralized Federated Learning: A Survey on Security and Privacy","authors":"Ehsan Hallaji;Roozbeh Razavi-Far;Mehrdad Saif;Boyu Wang;Qiang Yang","doi":"10.1109/TBDATA.2024.3362191","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3362191","url":null,"abstract":"Federated learning has been rapidly evolving and gaining popularity in recent years due to its privacy-preserving features, among other advantages. Nevertheless, the exchange of model updates and gradients in this architecture provides new attack surfaces for malicious users of the network which may jeopardize the model performance and user and data privacy. For this reason, one of the main motivations for decentralized federated learning is to eliminate server-related threats by removing the server from the network and compensating for it through technologies such as blockchain. However, this advantage comes at the cost of challenging the system with new privacy threats. Thus, performing a thorough security analysis in this new paradigm is necessary. This survey studies possible variations of threats and adversaries in decentralized federated learning and overviews the potential defense mechanisms. Trustability and verifiability of decentralized federated learning are also considered in this study.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 2","pages":"194-213"},"PeriodicalIF":7.2,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140123489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Hypergraph Structure Learning for Multivariate Time Series Forecasting","authors":"Shun Wang;Yong Zhang;Xuanqi Lin;Yongli Hu;Qingming Huang;Baocai Yin","doi":"10.1109/TBDATA.2024.3362188","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3362188","url":null,"abstract":"Multivariate time series forecasting plays an important role in many domain applications, such as air pollution forecasting and traffic forecasting. Modeling the complex dependencies among time series is a key challenging task in multivariate time series forecasting. Many previous works have used graph structures to learn inter-series correlations, which have achieved remarkable performance. However, graph networks can only capture spatio-temporal dependencies between pairs of nodes, which cannot handle high-order correlations among time series. We propose a Dynamic Hypergraph Structure Learning model (DHSL) to solve the above problems. We generate dynamic hypergraph structures from time series data using the K-Nearest Neighbors method. Then a dynamic hypergraph structure learning module is used to optimize the hypergraph structure to obtain more accurate high-order correlations among nodes. Finally, the hypergraph structures dynamically learned are used in the spatio-temporal hypergraph neural network. We conduct experiments on six real-world datasets. The prediction performance of our model surpasses existing graph network-based prediction models. The experimental results demonstrate the effectiveness and competitiveness of the DHSL model for multivariate time series forecasting.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"556-567"},"PeriodicalIF":7.5,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Charline Bour;Abir Elbeji;Luigi De Giovanni;Adrian Ahne;Guy Fagherazzi
{"title":"ALTRUIST: A Python Package to Emulate a Virtual Digital Cohort Study Using Social Media Data","authors":"Charline Bour;Abir Elbeji;Luigi De Giovanni;Adrian Ahne;Guy Fagherazzi","doi":"10.1109/TBDATA.2024.3362193","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3362193","url":null,"abstract":"Epidemiological cohort studies play a crucial role in identifying risk factors for various outcomes among participants. These studies are often time-consuming and costly due to recruitment and long-term follow-up. Social media (SM) data has emerged as a valuable complementary source for digital epidemiology and health research, as online communities of patients regularly share information about their illnesses. Unlike traditional clinical questionnaires, SM offer unstructured but insightful information about patients’ disease burden. Yet, there is limited guidance on analyzing SM data as a prospective cohort. We presented the concept of virtual digital cohort studies (VDCS) as an approach to replicate cohort studies using SM data. In this paper, we introduce ALTRUIST, an open-source Python package enabling standardized generation of VDCS on SM. ALTRUIST facilitates data collection, preprocessing, and analysis steps that mimic a traditional cohort study. We provide a practical use case focusing on diabetes to illustrate the methodology. By leveraging SM data, which offers large-scale and cost-effective information on users’ health, we demonstrate the potential of VDCS as an essential tool for specific research questions. ALTRUIST is customizable and can be applied to data from various online communities of patients, complementing traditional epidemiological methods and promoting minimally disruptive health research.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"568-575"},"PeriodicalIF":7.5,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10420428","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}