{"title":"Adaptive pooling with dual-stage fusion for skeleton-based action recognition","authors":"Cong Wu , Xiao-Jun Wu , Tianyang Xu , Josef Kittler","doi":"10.1016/j.neunet.2025.107615","DOIUrl":"10.1016/j.neunet.2025.107615","url":null,"abstract":"<div><div>Pooling is essential in computer vision; however, for skeleton-based action recognition, (1) the unique structure of the skeleton limits the applicability of existing pooling strategies, and (2) the high compactness and low redundancy of the skeleton make information loss after pooling more likely to degrade accuracy. Considering these factors, in this paper, we propose an Improved Graph Pooling Network, referred to as IGPN. First, our method incorporates a region-awareness pooling strategy based on structural partitioning. Specifically, we use the correlation matrix of the original features to adaptively adjust the information weights across different regions of the newly generated feature, allowing for more flexible and effective processing. To prevent the irreversible loss of discriminative information caused by pooling, we introduce dual-stage fusion strategy that includes cross fusion module and information supplement module, which respectively complement feature-level and data-level information. As a plug-and-play structure, the proposed operation can be seamlessly integrated with existing graph conventional networks. Based on our innovations, we develop IGPN-Light, optimised for efficiency, and IGPN-Heavy, optimised for accuracy. Extensive evaluations on several challenging benchmarks demonstrate the effectiveness of our solution. For instance, in cross-subject evaluation on the NTU-RGB+D 60 dataset, IGPN-Light achieves significant accuracy improvements over the baseline while reducing FLOPs (floating-point operations per second) by 60<span><math><mo>∼</mo></math></span>70%. Meanwhile, IGPN-Heavy further boosts performance by prioritising accuracy over efficiency.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107615"},"PeriodicalIF":6.0,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144239871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural NetworksPub Date : 2025-05-28DOI: 10.1016/j.neunet.2025.107620
Dawei Song , Yuan Yuan , Xuelong Li
{"title":"Potential region attention network for RGB-D salient object detection","authors":"Dawei Song , Yuan Yuan , Xuelong Li","doi":"10.1016/j.neunet.2025.107620","DOIUrl":"10.1016/j.neunet.2025.107620","url":null,"abstract":"<div><div>Many encouraging investigations have already been conducted on RGB-D salient object detection (SOD). However, most of these methods are limited in mining single-modal features and have not fully utilized the appropriate complementarity of cross-modal features. To alleviate the issues, this study designs a potential region attention network (PRANet) for RGB-D SOD. Specifically, the PRANet adopts Swin Transformer as its backbone to efficiently obtain two-stream features. Besides, a potential multi-scale attention module (PMAM) is equipped at the highest level of the encoder, which is beneficial for mining intra-modal information and enhancing feature expression. More importantly, a potential region attention module (PRAM) is designed to properly utilize the complementarity of cross-modal information, which adopts a potential region attention to guide two-stream feature fusion. In addition, by refining and correcting cross-layer features, a feature refinement fusion module (FRFM) is designed to strengthen the cross-layer information transmission between the encoder and decoder. Finally, the multi-side supervision is used during the training phase. Sufficient experimental results on 6 RGB-D SOD datasets indicate that our PRANet has achieved outstanding performance and is superior to 15 representative methods.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107620"},"PeriodicalIF":6.0,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural NetworksPub Date : 2025-05-28DOI: 10.1016/j.neunet.2025.107623
Pingping Dong , Xiaoning Zhang
{"title":"ST-GTrans: Spatio-temporal graph transformer with road network semantic awareness for traffic flow prediction","authors":"Pingping Dong , Xiaoning Zhang","doi":"10.1016/j.neunet.2025.107623","DOIUrl":"10.1016/j.neunet.2025.107623","url":null,"abstract":"<div><div>Accurate traffic prediction has significant implications for traffic optimization and management. However, few studies have thoroughly considered the implicit spatial semantic information and intricate temporal patterns. To address these challenges, we propose a spatio-temporal graph transformer with road network semantic awareness (ST-GTrans) for traffic flow prediction, an architecture that extends the transformer to effectively model spatio-temporal dependencies in traffic data. This model incorporates a multiscale temporal transformer designed to capture historical traffic patterns across multiple time scales, enabling the identification of short- and long-term temporal dependencies. Additionally, ST-GTrans addresses spatial dependencies by separately modeling the dynamic and static traffic components. Dynamic components employ a graph transformer with an edge that captures the semantic interactions between nodes through a multi-head attention mechanism. This mechanism integrates edge features from a semantic matrix constructed using a dynamic time-warping method based on time-series traffic data. For the static components, a multi-hop graph convolutional network was used to model the spatial dependencies rooted in the road network. Finally, a generative decoder was incorporated to mitigate error accumulation in long-term predictions. Extensive experiments on diverse datasets, including the PeMS03 traffic dataset (California freeway traffic data), the Shanghai metro flow dataset, and the Hong Kong traffic dataset, validated the effectiveness of ST-GTrans in capturing complex spatio-temporal patterns and demonstrated significant improvements over state-of-the-art baseline methods across multiple metrics.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107623"},"PeriodicalIF":6.0,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144195255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural NetworksPub Date : 2025-05-28DOI: 10.1016/j.neunet.2025.107621
Guangyi Ji, Xiao Hu
{"title":"Real-world blind super-resolution using stereoscopic feature and coupled optimization","authors":"Guangyi Ji, Xiao Hu","doi":"10.1016/j.neunet.2025.107621","DOIUrl":"10.1016/j.neunet.2025.107621","url":null,"abstract":"<div><div>Blind super-resolution is typically decomposed into degradation estimation and image restoration to mitigate the ill-condition. Most existing methods employ two independent models to address these two sub-problems separately. However, independent models fail to fully account for the correlation between degradation and image, leading to incompatibilities and subsequent performance decline. Additionally, numerous algorithms leverage convolutional neural networks (CNNs) for degradation estimation, which is inadequate for capturing degradation information with global semantics. Based on the problems above, we propose a novel Coupled Optimization Strategy (COS) and a stereoscopic feature processing block. Considering the correlation between degraded images and corresponding degradation parameters, COS solves two sub-problems using a single model, enabling the problem to be optimized within a unified solution space. Meanwhile, a stereoscopic extraction structure capable of capturing local, global, and locally-global fused features is developed to efficiently implement COS and accommodate the bimodality of blind super-resolution. Extensive experiments on real and synthetic datasets validate the effectiveness of our method, yielding a 0.2 dB gain in PSNR on the DIV2KRK dataset with scale factor 2, compared to state-of-the-art algorithms.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107621"},"PeriodicalIF":6.0,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144195256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural NetworksPub Date : 2025-05-27DOI: 10.1016/j.neunet.2025.107610
Keisuke Kawano , Takuro Kutsuna , Keisuke Sano
{"title":"Minimal sufficient views: A DNN model making predictions with more evidence has higher accuracy","authors":"Keisuke Kawano , Takuro Kutsuna , Keisuke Sano","doi":"10.1016/j.neunet.2025.107610","DOIUrl":"10.1016/j.neunet.2025.107610","url":null,"abstract":"<div><div>Deep neural networks (DNNs) exhibit high performance in image recognition; however, the reasons for their strong generalization abilities remain unclear. A plausible hypothesis is that DNNs achieve robust and accurate predictions by identifying multiple pieces of evidence from images. Thus, to test this hypothesis, this study proposed minimal sufficient views (MSVs). MSVs is defined as a set of minimal regions within an input image that are sufficient to preserve the prediction of DNNs, thus representing the evidence discovered by the DNN. We empirically demonstrated a strong correlation between the number of MSVs (i.e., the number of pieces of evidence) and the generalization performance of the DNN models. Remarkably, this correlation was found to hold within a single DNN as well as between different DNNs, including convolutional and transformer models. This suggested that a DNN model that makes its prediction based on more evidence has a higher generalization performance. We proposed a metric based on MSVs for DNN model selection that did not require label information. Consequently, we empirically showed that the proposed metric was less dependent on the degree of overfitting, rendering it a more reliable indicator of model performance than existing metrics, such as average confidence.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107610"},"PeriodicalIF":6.0,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural NetworksPub Date : 2025-05-27DOI: 10.1016/j.neunet.2025.107613
Huaping Zhou , Tao Wu , Kelei Sun , Jin Wu , Bin Deng , Xueseng Zhang
{"title":"HLGNet: High-Light Guided Network for low-light instance segmentation with spatial-frequency domain enhancement","authors":"Huaping Zhou , Tao Wu , Kelei Sun , Jin Wu , Bin Deng , Xueseng Zhang","doi":"10.1016/j.neunet.2025.107613","DOIUrl":"10.1016/j.neunet.2025.107613","url":null,"abstract":"<div><div>Instance segmentation models generally perform well under typical lighting conditions but struggle in low-light environments due to insufficient fine-grained detail. To address this, frequency domain enhancement has shown promise. However, the lack of spatial domain processing in existing frequency domain based methods often results in poor boundary delineation and inadequate local perception. To address these challenges, we propose HLGNet (High-Light Guided Network). By leveraging high-light image masks, our approach integrates enhancements in both the frequency and spatial domains, thereby improving the feature representation of low-light images. Specifically, we propose the SPE (Spatial-Frequency Enhancement) Block, which effectively combines and complements local spatial features with global frequency domain information. Additionally, we design the DAF (Dynamic Affine Fusion) module to inject frequency domain information into semantically significant features, thereby enhancing the model’s ability to capture both detailed target information and global semantic context. Finally, we propose the HLG Decoder, which dynamically adjusts the attention distribution by utilizing mutual information and entropy, guided by high-light image masks. This ensures improved focus on both local details and global semantics. Extensive quantitative and qualitative evaluations on two widely used low-light instance segmentation datasets demonstrate that HLGNet outperforms current state-of-the-art methods.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107613"},"PeriodicalIF":6.0,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural NetworksPub Date : 2025-05-27DOI: 10.1016/j.neunet.2025.107581
Xuzheng Wang, Zihan Fang, Shide Du, Wenzhong Guo, Shiping Wang
{"title":"MOAL: Multi-view Out-of-distribution Awareness Learning","authors":"Xuzheng Wang, Zihan Fang, Shide Du, Wenzhong Guo, Shiping Wang","doi":"10.1016/j.neunet.2025.107581","DOIUrl":"10.1016/j.neunet.2025.107581","url":null,"abstract":"<div><div>Multi-view learning integrates data from multiple sources to enhance task performance by improving data quality. However, existing approaches primarily focus on intra-distribution data learning and consequently fail to identify out-of-distribution instances effectively. This paper introduces a method to improve the perception of out-of-distribution data in multi-view situations. First, we employ multi-view consistency and complementarity principles to develop sub-view complementarity representation learning and multi-view consistency fusion layers, thereby enhancing the model’s perception ability to typical intra-distribution features. Additionally, we introduce a specialized multi-view training loss and an agent mechanism tailored for out-of-distribution scenarios, facilitating the ability to differentiate between known and new or anomalous instances effectively. The proposed approach enhances the recognition of out-of-distribution data by improving intra-distribution feature representations and minimizing the entropy associated with out-of-distribution instances. Experimental results on multiple multi-view datasets simulating out-of-distribution scenarios confirm the effectiveness of MOLA, which consistently outperforms all baselines with average accuracy improvements of over 5%.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107581"},"PeriodicalIF":6.0,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144184488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural NetworksPub Date : 2025-05-27DOI: 10.1016/j.neunet.2025.107609
Jiangdong Fan , Yuekeng Li , Jiayi Bi , Hui Xu , Jie Shao
{"title":"Task augmentation via channel mixture for few-task meta-learning","authors":"Jiangdong Fan , Yuekeng Li , Jiayi Bi , Hui Xu , Jie Shao","doi":"10.1016/j.neunet.2025.107609","DOIUrl":"10.1016/j.neunet.2025.107609","url":null,"abstract":"<div><div>Meta-learning is a promising approach for rapidly adapting to new tasks with minimal data by leveraging knowledge from previous tasks. However, meta-learning typically requires a large number of meta-training tasks. Existing methods often generate new tasks by interpolating fine-grained feature points, and such interpolation can compromise the continuity and integrity of the feature representations in the generated tasks. To address this problem, we propose task-level data augmentation to generate additional new tasks. Specifically, we introduce a novel task augmentation method called Task Augmentation via Channel Mixture (TACM). TACM generates new tasks by mixing channels from different tasks. This channel-level mixture preserves the continuity and integrity of feature information in channels during the mixture process, thereby enhancing the generalization ability of the model. Experimental results demonstrate that TACM outperforms other state-of-the-art methods across multiple datasets. Code is available at <span><span>https://github.com/F-GOD6/TACM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107609"},"PeriodicalIF":6.0,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144184486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural NetworksPub Date : 2025-05-27DOI: 10.1016/j.neunet.2025.107606
Qiyue Li , Xuemei Xie , Jin Zhang , Guangming Shi
{"title":"Recognizing human–object interactions in videos with the supervision of natural language","authors":"Qiyue Li , Xuemei Xie , Jin Zhang , Guangming Shi","doi":"10.1016/j.neunet.2025.107606","DOIUrl":"10.1016/j.neunet.2025.107606","url":null,"abstract":"<div><div>Existing models for recognizing human–object interaction (HOI) in videos mainly rely on visual information for reasoning and generally treat recognition tasks as traditional multi-classification problems, where labels are represented by numbers. This supervised learning method discards semantic information in the labels and ignores advanced semantic relationships between actual categories. In fact, natural language contains a wealth of linguistic knowledge that humans have distilled about human–object interaction, and the category text contains a large amount of semantic relationships between texts. Therefore, this paper introduces human–object interaction category text features as labels and proposes a natural language supervised learning model for human–object interaction by using natural language to supervise visual feature learning to enhance visual feature expression capability. The model applies contrastive learning paradigm to human–object interaction recognition, using an image–text paired pre-training model to obtain individual image features and interaction category text features, and then using a spatial–temporal mixed module to obtain high semantic combination-based human–object interaction spatial–temporal features. Finally, the obtained visual interaction features and category text features are compared for similarity to infer the correct video human–object interaction category. The model aims to explore the semantic information in human–object interaction category label text and use a large number of image–text paired samples trained by a multi-modal pre-training model to obtain visual and textual correspondence to enhance the ability of video human–object interaction recognition. Experimental results on two human–object interaction datasets demonstrate that our method achieves the state-of-the-art performance, e.g., 93.6% and 93.1% F1 Score for Sub-activity and Affordance on CAD-120 dataset.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107606"},"PeriodicalIF":6.0,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144169814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural NetworksPub Date : 2025-05-27DOI: 10.1016/j.neunet.2025.107618
Peng He , Jun Yu , Liuxue Ju , Fang Gao
{"title":"Fine-grained hierarchical dynamics for image harmonization","authors":"Peng He , Jun Yu , Liuxue Ju , Fang Gao","doi":"10.1016/j.neunet.2025.107618","DOIUrl":"10.1016/j.neunet.2025.107618","url":null,"abstract":"<div><div>Image harmonization aims to generate visually consistent composite images by ensuring compatibility between the foreground and background. Existing image harmonization strategies based on the global transformation emphasize using background information for foreground normalization, potentially overlooking significant variations in appearance among regions within various scenes. Simultaneously, the coherence of local information plays a critical role in generating visually consistent images as well. To address these issues, we propose the Hierarchical Dynamics Appearance Translation (HDAT) framework, enabling a seamless transition of features and parameters from local to global views in the network and adaptive adjustments of foreground appearance based on corresponding background information. Specifically, we introduce the dynamic region-aware convolution and fine-grained mixed attention mechanism to promote the harmonious coordination of global and local details. Among them, the dynamic region-aware convolution guided by foreground masks is utilized to learn adaptive representations and correlations of foreground and background elements based on global dynamics. Meanwhile, the fine-grained mixed attention dynamically adjusts features at different channels and positions to achieve local adaptations. Furthermore, we integrate a novel multi-scale feature calibration strategy to ensure information consistency across varying scales. Extensive experiments demonstrate that our HDAT framework significantly reduces the number of network parameters while outperforming existing methods both qualitatively and quantitatively.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"190 ","pages":"Article 107618"},"PeriodicalIF":6.0,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144178711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}