Souha Al Katat;Chamseddine Zaki;Hussein Hazimeh;Ibrahim El Bitar;Rafael Angarita;Lionel Trojman
{"title":"Natural Language Processing for Arabic Sentiment Analysis: A Systematic Literature Review","authors":"Souha Al Katat;Chamseddine Zaki;Hussein Hazimeh;Ibrahim El Bitar;Rafael Angarita;Lionel Trojman","doi":"10.1109/TBDATA.2024.3366083","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366083","url":null,"abstract":"Sentiment analysis involves using computational methods to identify and classify opinions expressed in text, with the goal of determining whether the writer's stance towards a particular topic, product, or idea is positive, negative, or neutral. However, sentiment analysis in Arabic presents unique challenges due to the complexity of Arabic morphology and the variety of dialects, which make language classification even more difficult. To address these challenges, we conducted to investigation and overview the techniques used in the last five years for embedding and classification of Arabic sentiment analysis (ASA). We collected data from 100 publications, resulting in a representative dataset of 2,300 detailed records that included attributes related to the dataset, feature extraction, approach, parameters, and performance measures. Our study aimed to identify the most powerful approaches and best model settings by analyzing the collected data to identify the significant parameters influencing performance. The results showed that Deep Learning and Machine Learning were the most commonly used techniques, followed by lexicon and transformer-based techniques. However, Deep Learning models were found to be more accurate for sentiment classification than other Machine Learning models. Furthermore, multi-level embedding was found to be a significant step in improving model accuracy.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"576-594"},"PeriodicalIF":7.5,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ana Belén Rodríguez González;Javier Burrieza-Galán;Juan José Vinagre Díaz;Inés Peirats de Castro;Mark Richard Wilby;Oliva Garcia Cantú-Ros
{"title":"Using App Usage Data From Mobile Devices to Improve Activity-Based Travel Demand Models","authors":"Ana Belén Rodríguez González;Javier Burrieza-Galán;Juan José Vinagre Díaz;Inés Peirats de Castro;Mark Richard Wilby;Oliva Garcia Cantú-Ros","doi":"10.1109/TBDATA.2024.3366088","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366088","url":null,"abstract":"In the last years we have seen several studies showing the potential of mobile network data to reconstruct activity and mobility patterns of the population. These data sources allow continuous monitoring of the population with a higher degree of spatial and temporal resolution and at a lower cost compared with traditional methods. However, for certain applications, the spatial resolution of these data sources is still not enough since it typically provides a spatial resolution of hundreds of meters in urban areas and of few kilometers in rural areas. In this article, we fill this gap by proposing a methodology that utilises GPS data from the usage of different applications in mobile devices. This approach improves the spatial precision in the location of activities, previously identified with the mobile network data.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"633-643"},"PeriodicalIF":7.5,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10436340","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HGN2T: A Simple but Plug-and-Play Framework Extending HGNNs on Heterogeneous Temporal Graphs","authors":"Huan Liu;Pengfei Jiao;Xuan Guo;Huaming Wu;Mengzhou Gao;Jilin Zhang","doi":"10.1109/TBDATA.2024.3366085","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366085","url":null,"abstract":"Heterogeneous graphs (HGs) with multiple entity and relation types are common in real-world networks. Heterogeneous graph neural networks (HGNNs) have shown promise for learning HG representations. However, most HGNNs are designed for static HGs and are not compatible with heterogeneous temporal graphs (HTGs). A few existing works have focused on HTG representation learning but they care more about how to capture the dynamic evolutions and less about their compatibility with those well-designed static HGNNs. They also handle graph structure and temporal dependency learning separately, ignoring that HTG evolutions are influenced by both nodes and relationships. To address this, we propose HGN2T, a simple and general framework that makes static HGNNs compatible with HTGs. HGN2T is plug-and-play, enabling static HGNNs to leverage their graph structure learning strengths. To capture the relationship-influenced evolutions, we design a special mechanism coupling both the HGNN and sequential model. Finally, through joint optimization by both detection and prediction tasks, the learned representations can fully capture temporal dependencies from historical information. We conduct several empirical evaluation tasks, and the results show our HGN2T can adapt static HGNNs to HTGs and overperform existing methods for HTGs.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"620-632"},"PeriodicalIF":7.5,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised Cross-View Subspace Clustering via Adaptive Contrastive Learning","authors":"Zihao Zhang;Qianqian Wang;Quanxue Gao;Chengquan Pei;Wei Feng","doi":"10.1109/TBDATA.2024.3366084","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3366084","url":null,"abstract":"Cross-view subspace clustering has become a popular unsupervised method for cross-view data analysis because it can extract both the consistent and complementary features of data for different views. Nonetheless, existing methods usually ignore the discriminative features due to a lack of label supervision, which limits its further improvement in clustering performance. To address this issue, we design a novel model that leverages the self-supervision information embedded in the data itself by combining contrastive learning and self-expression learning, i.e., unsupervised cross-view subspace clustering via adaptive contrastive learning (CVCL). Specifically, CVCL employs an encoder to learn a latent subspace from the cross-view data and convert it to a consistent subspace with a self-expression layer. In this way, contrastive learning helps to provide more discriminative features for the self-expression learning layer, and the self-expression learning layer in turn supervises contrastive learning. Besides, CVCL adaptively chooses positive and negative samples for contrastive learning to reduce the noisy impact of improper negative sample pairs. Ultimately, the decoder is designed for reconstruction tasks, operating on the output of the self-expressive layer, and strives to faithfully restore the original data as much as possible, ensuring that the encoded features are potentially effective. Extensive experiments conducted across multiple cross-view datasets showcase the exceptional performance and superiority of our model.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"609-619"},"PeriodicalIF":7.5,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ehsan Hallaji;Roozbeh Razavi-Far;Mehrdad Saif;Boyu Wang;Qiang Yang
{"title":"Decentralized Federated Learning: A Survey on Security and Privacy","authors":"Ehsan Hallaji;Roozbeh Razavi-Far;Mehrdad Saif;Boyu Wang;Qiang Yang","doi":"10.1109/TBDATA.2024.3362191","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3362191","url":null,"abstract":"Federated learning has been rapidly evolving and gaining popularity in recent years due to its privacy-preserving features, among other advantages. Nevertheless, the exchange of model updates and gradients in this architecture provides new attack surfaces for malicious users of the network which may jeopardize the model performance and user and data privacy. For this reason, one of the main motivations for decentralized federated learning is to eliminate server-related threats by removing the server from the network and compensating for it through technologies such as blockchain. However, this advantage comes at the cost of challenging the system with new privacy threats. Thus, performing a thorough security analysis in this new paradigm is necessary. This survey studies possible variations of threats and adversaries in decentralized federated learning and overviews the potential defense mechanisms. Trustability and verifiability of decentralized federated learning are also considered in this study.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 2","pages":"194-213"},"PeriodicalIF":7.2,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140123489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Hypergraph Structure Learning for Multivariate Time Series Forecasting","authors":"Shun Wang;Yong Zhang;Xuanqi Lin;Yongli Hu;Qingming Huang;Baocai Yin","doi":"10.1109/TBDATA.2024.3362188","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3362188","url":null,"abstract":"Multivariate time series forecasting plays an important role in many domain applications, such as air pollution forecasting and traffic forecasting. Modeling the complex dependencies among time series is a key challenging task in multivariate time series forecasting. Many previous works have used graph structures to learn inter-series correlations, which have achieved remarkable performance. However, graph networks can only capture spatio-temporal dependencies between pairs of nodes, which cannot handle high-order correlations among time series. We propose a Dynamic Hypergraph Structure Learning model (DHSL) to solve the above problems. We generate dynamic hypergraph structures from time series data using the K-Nearest Neighbors method. Then a dynamic hypergraph structure learning module is used to optimize the hypergraph structure to obtain more accurate high-order correlations among nodes. Finally, the hypergraph structures dynamically learned are used in the spatio-temporal hypergraph neural network. We conduct experiments on six real-world datasets. The prediction performance of our model surpasses existing graph network-based prediction models. The experimental results demonstrate the effectiveness and competitiveness of the DHSL model for multivariate time series forecasting.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"556-567"},"PeriodicalIF":7.5,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Charline Bour;Abir Elbeji;Luigi De Giovanni;Adrian Ahne;Guy Fagherazzi
{"title":"ALTRUIST: A Python Package to Emulate a Virtual Digital Cohort Study Using Social Media Data","authors":"Charline Bour;Abir Elbeji;Luigi De Giovanni;Adrian Ahne;Guy Fagherazzi","doi":"10.1109/TBDATA.2024.3362193","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3362193","url":null,"abstract":"Epidemiological cohort studies play a crucial role in identifying risk factors for various outcomes among participants. These studies are often time-consuming and costly due to recruitment and long-term follow-up. Social media (SM) data has emerged as a valuable complementary source for digital epidemiology and health research, as online communities of patients regularly share information about their illnesses. Unlike traditional clinical questionnaires, SM offer unstructured but insightful information about patients’ disease burden. Yet, there is limited guidance on analyzing SM data as a prospective cohort. We presented the concept of virtual digital cohort studies (VDCS) as an approach to replicate cohort studies using SM data. In this paper, we introduce ALTRUIST, an open-source Python package enabling standardized generation of VDCS on SM. ALTRUIST facilitates data collection, preprocessing, and analysis steps that mimic a traditional cohort study. We provide a practical use case focusing on diabetes to illustrate the methodology. By leveraging SM data, which offers large-scale and cost-effective information on users’ health, we demonstrate the potential of VDCS as an essential tool for specific research questions. ALTRUIST is customizable and can be applied to data from various online communities of patients, complementing traditional epidemiological methods and promoting minimally disruptive health research.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"568-575"},"PeriodicalIF":7.5,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10420428","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine-Tuned Personality Federated Learning for Graph Data","authors":"Meiting Xue;Zian Zhou;Pengfei Jiao;Huijun Tang","doi":"10.1109/TBDATA.2024.3356388","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3356388","url":null,"abstract":"Federated Learning (FL) empowers multiple clients to collaboratively learn a global generalization model without the need to share their local data, thus reducing privacy risks and expanding the scope of AI applications. However, current works focus less on data in a highly nonidentically distributed manner such as graph data which are common in reality, and ignore the problem of model personalization between clients for graph data training in federated learning. In this paper, we propose a novel personality graph federated learning framework based on variational graph autoencoders that incorporates model contrastive learning and local fine-tuning to achieve personalized federated training on graph data for each client, which is called FedVGAE. Then we introduce an encoder-sharing strategy to the proposed framework that shares the parameters of the encoder layer to further improve personality performance. The node classification and link prediction experiments demonstrate that our method achieves better performance than other federated learning methods on most graph datasets in the non-iid setting. Finally, we conduct ablation experiments, the result demonstrates the effectiveness of our proposed method.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"313-319"},"PeriodicalIF":7.2,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dawei Dai;Yingge Liu;Yutang Li;Shiyu Fu;Shuyin Xia;Guoyin Wang
{"title":"LGRL: Local-Global Representation Learning for On-the-Fly FG-SBIR","authors":"Dawei Dai;Yingge Liu;Yutang Li;Shiyu Fu;Shuyin Xia;Guoyin Wang","doi":"10.1109/TBDATA.2024.3356393","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3356393","url":null,"abstract":"On-the-fly Fine-grained sketch-based image retrieval (On-the-fly FG-SBIR) framework aim to break the barriers that sketch drawing requires excellent skills and is time-consuming. Considering such problems, a partial sketch with fewer strokes contains only the little local information, and the drawing process may show great difference among users, resulting in poor performance at the early retrieval. In this study, we developed a local-global representation learning (LGRL) method, in which we learn the representations for both the local and global regions of the partial sketch and its target photos. Specifically, we first designed a triplet network to learn the joint embedding space shared between the local and global regions of the entire sketch and its corresponding region of the photo. Then, we divided each partial sketch in the sketch-drawing episode into several local regions; Another learnable module following the triplet network was designed to learn the representations for the local regions of the partial sketch. Finally, by combining both the local and global regions of the sketches and photos, the final distance was determined. In the experiments, our method outperformed state-of-the-art baseline methods in terms of early retrieval efficiency on two publicly sketch-retrieval datasets and the practice test.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"543-555"},"PeriodicalIF":7.5,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GAT-COBO: Cost-Sensitive Graph Neural Network for Telecom Fraud Detection","authors":"Xinxin Hu;Haotian Chen;Junjie Zhang;Hongchang Chen;Shuxin Liu;Xing Li;Yahui Wang;Xiangyang Xue","doi":"10.1109/TBDATA.2024.3352978","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3352978","url":null,"abstract":"Along with the rapid evolution of mobile communication technologies, such as 5G, there has been a significant increase in telecom fraud, which severely dissipates individual fortune and social wealth. In recent years, graph mining techniques are gradually becoming a mainstream solution for detecting telecom fraud. However, the graph imbalance problem, caused by the Pareto principle, brings severe challenges to graph data mining. This emerging and complex issue has received limited attention in prior research. In this paper, we propose a \u0000<underline>G</u>\u0000raph \u0000<underline>AT</u>\u0000tention network with \u0000<underline>CO</u>\u0000st-sensitive \u0000<underline>BO</u>\u0000osting (GAT-COBO) for the graph imbalance problem. First, we design a GAT-based base classifier to learn the embeddings of all nodes in the graph. Then, we feed the embeddings into a well-designed cost-sensitive learner for imbalanced learning. Next, we update the weights according to the misclassification cost to make the model focus more on the minority class. Finally, we sum the node embeddings obtained by multiple cost-sensitive learners to obtain a comprehensive node representation, which is used for the downstream anomaly detection task. Extensive experiments on two real-world telecom fraud detection datasets demonstrate that our proposed method is effective for the graph imbalance problem, outperforming the state-of-the-art GNNs and GNN-based fraud detectors. In addition, our model is also helpful for solving the widespread over-smoothing problem in GNNs.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"528-542"},"PeriodicalIF":7.5,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}