Big Data Research最新文献

Adaptive spectral GNN and frequency enhanced self-attention for traffic forecasting 自适应频谱GNN和频率增强自关注交通预测

IF 4.2 3区计算机科学

Big Data Research Pub Date : 2025-10-09 DOI: 10.1016/j.bdr.2025.100567

Yongpeng Yang , Zhenzhen Yang

{"title":"Adaptive spectral GNN and frequency enhanced self-attention for traffic forecasting","authors":"Yongpeng Yang , Zhenzhen Yang","doi":"10.1016/j.bdr.2025.100567","DOIUrl":"10.1016/j.bdr.2025.100567","url":null,"abstract":"<div><div>In intelligent city, traffic forecasting has played a significant role in intelligent transportation system. Nowadays, many methods, which combine spectral graph neural network and self-attention, are proposed. However, they still have some limitations for traffic forecasting: 1) The polynomial basis of traditional spectral graph neural networks (GNN) is fixed, which limits their ability to learn spatial dependency of traffic data. 2) Some GNNs ignore the dynamic dependency of traffic data. 3) Traditional self-attention suffers from limited perception for long-term information, time delay, and global information. These defaults pose big challenge for traffic forecasting via limiting their ability of capturing spatial-temporal dependency, dynamic and heterogeneous nature in traffic data. From this perspective, we propose an adaptive spectral GNN and frequency enhanced self-attention (ASGFES) for traffic forecasting, which can effectively capture the spatial-temporal dependency, dynamic and heterogeneous nature in traffic data. Specifically, we first introduce an adaptive spectral graph neural network (ASGNN) for effectively capturing the spatial dependency via conducting adaptive polynomial basis. In addition, two dynamic long and short range attentive graphs are fed into the ASGNN for emphasizing the dynamicity in view of long and short range. Secondly, we introduce a normalized self-attention with damped exponential moving average (NSADEMA). Specifically, the normalized self-attention (NSA) can capture the necessary expressivity to learn all-pair interactions without the need for some extra operation such as positional encodings, multi-head operations, and so on. It can well obtain the temporal dependency and heterogeneity of traffic data. In addition, the DEMA, which is equipped into NSA, can enhance the perception for the inductive bias of traffic data in time domain. It can be aware of the time delay of traffic data. Thirdly, linear frequency learner with time-series decomposition (LFLTD) are developed for enhancing the ability of capturing the temporal dependency and heterogeneity. Specifically, time-series decomposition (TSD) facilitates the analysis and forecasting of complex time via capturing various hidden components such as the trend and seasonal components. Meanwhile, linear frequency learner (LFL) can learn global dependencies and concentrating on important part of frequency components with compact signal energy. At last, many experiments are performed on several public traffic datasets and demonstrate the proposed ASGFES can achieve better performance than other traffic forecasting methods.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"42 ","pages":"Article 100567"},"PeriodicalIF":4.2,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145271154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A decentralized metaheuristic approach to feature selection inspired by social interactions within a societal framework, for handling datasets of diverse sizes 一种分散的元启发式方法，以社会框架内的社会互动为灵感，用于处理不同规模的数据集

IF 4.2 3区计算机科学

Big Data Research Pub Date : 2025-08-28 DOI: 10.1016/j.bdr.2025.100556

Sobia Tariq Javed , Kashif Zafar , Irfan Younas

{"title":"A decentralized metaheuristic approach to feature selection inspired by social interactions within a societal framework, for handling datasets of diverse sizes","authors":"Sobia Tariq Javed , Kashif Zafar , Irfan Younas","doi":"10.1016/j.bdr.2025.100556","DOIUrl":"10.1016/j.bdr.2025.100556","url":null,"abstract":"<div><div>The rapid advancement of technology has led to the generation of big data. This vast and diverse data can uncover valuable patterns and yield promising results when effectively mined, processed, and analyzed. However, it also introduces the “curse of dimensionality,” which can negatively impact the performance of machine learning models. Feature Selection (FS) is a data preprocessing technique aimed at identifying the optimal feature set to enhance model efficiency and reduce processing time. Numerous metaheuristic wrapper-based FS techniques have been explored in the literature. However, a significant drawback of many of these algorithms is their dependence on centralized learning, where the global best solution drives the search direction. This centralized approach is risky, as any error by the global best can hinder the exploration and exploitation of other potential areas, leading to inaccuracies in discovering the true global optimum. In this paper, the binary variant of a novel decentralized metaheuristic Kids Learning Optimization Algorithm (KLO) called <strong>Binary Kids Learning Optimization Algorithm (BKLO)</strong> is proposed for optimal feature selection for classification purposes in wrapper mode. The continuous solutions of KLO are converted to binary space by using the transfer function. A comparison is provided between the two transfer functions: hyperbolic tan (V-shaped) and the Sigmoidal (S-shaped) transfer functions. BKLO is compared with seven state-of-the-art algorithms. The performance of algorithms is evaluated and compared using several assessment indicators over fifteen benchmark datasets with a wide range of dimensions (small, medium, and large) from the University of California Irvine (UCI) repository and Arizona State University. The superiority of BKLO in reducing the number of features with increased classification accuracy over the other competing algorithms is demonstrated through the experiments and Friedman's Mean Rank (FMR) statistical tests.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100556"},"PeriodicalIF":4.2,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144903932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Compression of big data collected in wind farm based on tensor train decomposition 基于张量列分解的风电场大数据压缩

IF 4.2 3区计算机科学

Big Data Research Pub Date : 2025-08-20 DOI: 10.1016/j.bdr.2025.100554

Keren Li , Wenqiang Zhang , Dandan Xiao , Peng Hou , Shuai Yan , Yang Wang , Xuerui Mao

引用次数: 0

Explainable malware detection through integrated graph reduction and learning techniques 可解释的恶意软件检测通过集成图约简和学习技术

IF 4.2 3区计算机科学

Big Data Research Pub Date : 2025-08-19 DOI: 10.1016/j.bdr.2025.100555

Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani

{"title":"Explainable malware detection through integrated graph reduction and learning techniques","authors":"Hesamodin Mohammadian, Griffin Higgins, Samuel Ansong, Roozbeh Razavi-Far, Ali A. Ghorbani","doi":"10.1016/j.bdr.2025.100555","DOIUrl":"10.1016/j.bdr.2025.100555","url":null,"abstract":"<div><div>Recently, Control Flow Graphs and Function Call Graphs have gain attention in malware detection task due to their ability in representation the complex structural and functional behavior of programs. To better utilize these representations in malware detection and improve the detection performance, they have been paired with Graph Neural Networks (GNNs). However, the sheer size and complexity of these graph representation poses a significant challenge for researchers. At the same time, a simple binary classification provided by the GNN models is insufficient for malware analysts. To address these challenges, this paper integrates novel graph reduction techniques and GNN explainability in to a malware detection framework to enhance both efficiency and interpretability. Through our extensive evolution, we demonstrate that the proposed graph reduction technique significantly reduces the size and complexity of the input graphs, while maintaining the detection performance. Furthermore, the extracted important subgraphs using the GNNExplainer, provide better insights about the model's decision and help security experts with their further analysis.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100555"},"PeriodicalIF":4.2,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NGLinker: Link prediction for node featureless networks NGLinker：无节点特征网络的链路预测

IF 4.2 3区计算机科学

Big Data Research Pub Date : 2025-08-18 DOI: 10.1016/j.bdr.2025.100558

Yong Li , Jingpeng Wu , Zhongying Zhang

{"title":"NGLinker: Link prediction for node featureless networks","authors":"Yong Li , Jingpeng Wu , Zhongying Zhang","doi":"10.1016/j.bdr.2025.100558","DOIUrl":"10.1016/j.bdr.2025.100558","url":null,"abstract":"<div><div>Link prediction is a paradigmatic problem with tremendous real-world applications in network science, which aims to infer missing links or future links based on currently observed partial nodes and links. However, conventional link prediction models are based on network structure, with relatively low prediction accuracy and lack universality and scalability. The performance of link prediction based on machine learning and artificial features is greatly influenced by subjective consciousness. Although graph embedding learning (GEL) models can avoid these shortcomings, it still poses some challenges. Because GEL models are generally based on random walks and graph neural networks (GNNs), their prediction accuracy is relatively ineffective, making them unsuitable for revealing hidden information in node featureless networks. To address these challenges, we present NGLinker, a new link prediction model based on Node2vec and GraphSage, which can reconcile the performance and accuracy in a node featureless network. Rather than learning node features with label information, NGLinker depends only on the local network structure. Quantitatively, we observe superior prediction accuracy of NGLinker and lab test imputations compared to the state-of-the-art models, which strongly supports that using NGLinker to predict three public networks and one private network and then conduct prediction results is feasible and effective. The NGLinker can not only achieve prediction accuracy in terms of precision and area under the receiver operating characteristic curve (AUC) but also acquire strong universality and scalability. The NGLinker model enlarges the application of the GNNs to node featureless networks.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100558"},"PeriodicalIF":4.2,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Research on modeling of the imbalanced fraudulent transaction detection problem based on embedding-aware conditional GAN 基于嵌入感知条件GAN的不平衡欺诈交易检测问题建模研究

IF 4.2 3区计算机科学

Big Data Research Pub Date : 2025-08-13 DOI: 10.1016/j.bdr.2025.100557

Luping Zhi , Wanmin Wang

{"title":"Research on modeling of the imbalanced fraudulent transaction detection problem based on embedding-aware conditional GAN","authors":"Luping Zhi , Wanmin Wang","doi":"10.1016/j.bdr.2025.100557","DOIUrl":"10.1016/j.bdr.2025.100557","url":null,"abstract":"<div><div>Detecting fraudulent transactions in structured financial data presents significant challenges due to multimodal, non-Gaussian continuous variables, mixed-type features, and severe class imbalance. To address these issues, we propose an Embedding-Aware Conditional Generative Adversarial Network (EAC-GAN), which incorporates trainable label embeddings into both the generator and discriminator to enable semantically controlled synthesis of minority-class samples. In addition to adversarial training, EAC-GAN introduces an auxiliary classification objective, forming a joint optimization strategy that improves the fidelity and class consistency of generated data, especially for underrepresented classes. Experiments conducted on a real-world credit card dataset demonstrate that EAC-GAN achieves stable convergence even with limited labeled data. When combined with LightGBM classifiers, the synthetic samples generated by EAC-GAN significantly enhance fraud detection performance, yielding a precision of 96.8%, an AUC of 96.38%, an AUPRC of 83.89%, and an MCC of 88.94%. Furthermore, dimensionality reduction using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) reveals that the generated samples closely align with the real data distribution and exhibit clear class separability in the latent space. These results underscore the effectiveness of EAC-GAN in synthesizing high-quality minority-class samples and improving downstream fraud detection, outperforming traditional oversampling techniques and baseline generative models.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100557"},"PeriodicalIF":4.2,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144863265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep neural network modeling for financial time series analysis 金融时间序列分析的深度神经网络建模

IF 3.5 3区计算机科学

Big Data Research Pub Date : 2025-06-09 DOI: 10.1016/j.bdr.2025.100553

Zheng Fang , Toby Cai

引用次数: 0

Time-synchronized sentiment labeling via autonomous online comments data mining: A multimodal information fusion on large-scale multimedia data 基于自主在线评论数据挖掘的时间同步情感标记：大规模多媒体数据的多模态信息融合

IF 3.5 3区计算机科学

Big Data Research Pub Date : 2025-06-08 DOI: 10.1016/j.bdr.2025.100552

Jiachen Ma , Nazmus Sakib , Fahim Islam Anik , Sheikh Iqbal Ahamed

{"title":"Time-synchronized sentiment labeling via autonomous online comments data mining: A multimodal information fusion on large-scale multimedia data","authors":"Jiachen Ma , Nazmus Sakib , Fahim Islam Anik , Sheikh Iqbal Ahamed","doi":"10.1016/j.bdr.2025.100552","DOIUrl":"10.1016/j.bdr.2025.100552","url":null,"abstract":"<div><div>While temporal sentiment labels prove invaluable for video tagging, segmentation, and labeling tasks in multimedia studies, large-scale manual annotation remains cost and time-prohibitive. Emerging Online Time-Sync Comment (TSC) datasets offer promising alternatives for generating sentiment maps. However, limitations in existing TSC scope and a lack of resource-constrained data creation guidelines hinder broader use. This study addresses these challenges by proposing a novel system for automated TSC generation utilizing recent YouTube comments as a readily accessible source of time-synchronized data. The efficacy of our multi-platform data mining system is evaluated through extensive long-term trials, leading to the development and analysis of two large-scale TSC datasets. Benchmarking against original temporal Automatic Speech Recognition (ASR) sentiment annotations validates the accuracy of our generated data. This work establishes a promising method for automatic TSC generation, laying the groundwork for further advancements in multimedia research and paving the way for novel sentiment analysis applications.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100552"},"PeriodicalIF":3.5,"publicationDate":"2025-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144307271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Development of an integrated data system for regional tourism analysis in Italy: A microdata perspective 意大利区域旅游分析综合数据系统的开发：微数据视角

IF 3.5 3区计算机科学

Big Data Research Pub Date : 2025-06-07 DOI: 10.1016/j.bdr.2025.100550

Samuele Cesarini, Fabrizio Antolini, Ivan Terraglia

引用次数: 0

BETM: A new pre-trained BERT-guided embedding-based topic model BETM：一种新的预训练bert引导的基于嵌入的主题模型

IF 3.5 3区计算机科学

Big Data Research Pub Date : 2025-06-06 DOI: 10.1016/j.bdr.2025.100551

Yang Liu , Xiaotang Zhou , Zhenwei Zhang , Xiran Yang

{"title":"BETM: A new pre-trained BERT-guided embedding-based topic model","authors":"Yang Liu , Xiaotang Zhou , Zhenwei Zhang , Xiran Yang","doi":"10.1016/j.bdr.2025.100551","DOIUrl":"10.1016/j.bdr.2025.100551","url":null,"abstract":"<div><div>The application of topic models and pre-trained BERT is becoming increasingly widespread in Natural Language Processing (NLP), but there is no standard method for incorporating them. In this paper, we propose a new pre-trained BERT-guided Embedding-based Topic Model (BETM). Through constraints on the topic-word distribution and document-topic distributions, BETM can ingeniously learn semantic information, syntactic information and topic information from BERT embeddings. In addition, we design two solutions to improve the problem of insufficient contextual information caused by short input and the issue of semantic truncation caused by long put in BETM. We find that word embeddings of BETM are more suitable for topic modeling than pre-trained GloVe word embeddings, and BETM can flexibly select different variants of the pre-trained BERT for specific datasets to obtain better topic quality. And we find that BETM is good at handling large and heavy-tailed vocabularies even if it contains stop words. BETM obtained the State-Of-The-Art (SOTA) on several benchmark datasets - Yelp Review Polarity (106,586 samplest), Wiki Text 103 (71,533 samples), Open-Web-Text (35,713 samples), 20Newsgroups (10,899 samples), and AG-news (127,588 samples).</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100551"},"PeriodicalIF":3.5,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144270762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0