Wanzhen Zhou , Junjie Wang , Xi Meng , Jianxia Wang , Yufei Song , Zhiguo Liu
{"title":"MP-YOLO: multidimensional feature fusion based layer adaptive pruning YOLO for dense vehicle object detection algorithm","authors":"Wanzhen Zhou , Junjie Wang , Xi Meng , Jianxia Wang , Yufei Song , Zhiguo Liu","doi":"10.1016/j.jvcir.2025.104560","DOIUrl":"10.1016/j.jvcir.2025.104560","url":null,"abstract":"<div><div>In recent years, artificial intelligence technology has been applied in the research and development of autonomous vehicles. However, the high energy consumption of artificial intelligence models and the high precision requirements of object detection in autonomous driving have led to a stagnation in the development of autonomous vehicles. To alleviate the above problems, we optimize YOLOv8 and propose a lightweight vehicle object detection algorithm, MP-YOLO (Multidimensional feature fusion and layer adaptive pruning YOLO), to adapt to edge devices with limited storage while meeting the requirements for detection accuracy. Firstly, two multi-scale feature fusion modules, MSFB and HFF, are proposed to merge features of different dimensions, enhancing the model’s feature learning capability. Secondly, a detection head at a scale of 160*160 is added to improve small object detection capability. Thirdly, the WIoU loss function replaces the original CIOU loss function in YOLOv8 to address the issue of high overlap among road objects. Lastly, using the Layer Adaptive Sparsity for Magnitude-based Pruning (LAMP) method to significantly reduce model size. The MP-YOLO model was tested on the latest automatic driving dataset DAIR-V2X, and the results showed that the performance of the proposed MP-YOLO exceeded the original model, with improvements of 4.7 % in AP<sub>50</sub> and 4.2 % in AP, and the model size changed from the initial 6 MB to 2.2 MB. It is superior to other classical detection models in terms of volume and accuracy, and meets the requirements of deployment on edge devices. The source code is available at <span><span>https://github.com/Wang-jj-zs/MP-YOLO/tree/master</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104560"},"PeriodicalIF":3.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Res2former: A multi-scale fusion based transformer feature extraction method","authors":"Bojun Xie, Yanjie Wang, Shaocong Guo, Junfen Chen","doi":"10.1016/j.jvcir.2025.104546","DOIUrl":"10.1016/j.jvcir.2025.104546","url":null,"abstract":"<div><div>In this paper, we propose Res2former, a novel lightweight hybrid architecture that combines convolutional neural networks (CNNs) and Transformers to effectively model both local and global dependencies in visual data. While Vision Transformer (ViT) demonstrates strong global modeling capability, it lack locality and translation-invariance, making it reliant on large-scale datasets and computational resources. To address this, Res2former adopts a stage-wise hybrid design: in shallow layers, CNNs replace Transformer blocks to exploit local inductive biases and reduce early computational cost; in deeper layers, we introduce a multi-scale fusion mechanism by embedding multiple parallel convolutional kernels of varying receptive fields into the Transformer’s MLP structure. This enables Res2former to capture multi-scale visual semantics more effectively and fuse features across different scales. Experimental results reveal that with the same parameters and computational complexity, Res2former outperforms variants of Transformer and CNN models in image classification (80.7 top-1 accuracy on ImageNet-1K), object detection (45.8 <span><math><mrow><mi>A</mi><msup><mrow><mi>P</mi></mrow><mrow><mi>b</mi><mi>o</mi><mi>x</mi></mrow></msup></mrow></math></span> on the COCO 2017 Validation Set), and instance segmentation (41.0 <span><math><mrow><mi>A</mi><msup><mrow><mi>P</mi></mrow><mrow><mi>m</mi><mi>a</mi><mi>s</mi><mi>k</mi></mrow></msup></mrow></math></span> on the COCO 2017 Validation Set) tasks. The code is publicly accessible at <span><span>https://github.com/hand-Max/Res2former</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104546"},"PeriodicalIF":3.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two-dimensional normalized knowledge distillation leveraging class relations","authors":"Benhong Zhang, Yiren Song, Yidong Zhang, Xiang Bi","doi":"10.1016/j.jvcir.2025.104557","DOIUrl":"10.1016/j.jvcir.2025.104557","url":null,"abstract":"<div><div>Knowledge distillation (KD) as one of the important methods of model compression, has been widely used in tasks such as image classification and detection. Existing KD methods are mainly carried out at the instance level and often ignore the role of inter-class relational information. Additionally, when there is a significant gap between the student’s capacity and the teacher’s capacity, the two model cannot be matched precisely. To address these issues, this paper proposes a two-dimensional normalized knowledge distillation method, which integrates inter-class and intra-class correlations and rectifies logits in two dimensions. Through our approach, the student model is able to acquire contextual information between samples with the help of intra-class correlation and mitigate the effect of logits magnitude on the prediction results through normalized rectification. We conduct numerous experiments and results show that our method achieves higher accuracy and better training efficiency compared to traditional KD methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104557"},"PeriodicalIF":3.1,"publicationDate":"2025-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144911532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunping Zheng , Zhou Jiang , Shiqiang Shu , Yuze Zhu , Zejun Wang , Mudar Sarem
{"title":"Salient object detection enhanced pseudo-labels for weakly supervised semantic segmentation","authors":"Yunping Zheng , Zhou Jiang , Shiqiang Shu , Yuze Zhu , Zejun Wang , Mudar Sarem","doi":"10.1016/j.jvcir.2025.104548","DOIUrl":"10.1016/j.jvcir.2025.104548","url":null,"abstract":"<div><div>To address the limitations of generating the pseudo-labels based on Class Activation Maps (CAM) in the weakly supervised semantic segmentation tasks, in this paper, we propose a novel salient object fusion framework. This framework complements CAM localization information by capturing the complete contours and the edge details of salient targets through our proposed RGB-SOD network. Also, we design a saliency object selector to dynamically balance the weights of CAM and Salient Object Detection (SOD) when generating the single-class pseudo-labels, further improving the quality of the pseudo-labels. Despite its simplicity, our method achieved competitive performances of 77.52% and 77.73% on the PASCAL VOC 2012 validation and the test sets respectively, significantly enhancing the performance bottlenecks of the SOTA methods. This work highlights the importance of effectively integrating complementary information to improve weakly supervised segmentation tasks. Our source codes are publicly available at <span><span>https://github.com/UGVly/SOD-For-WSSS.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104548"},"PeriodicalIF":3.1,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144852806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FocusTrack: Enhancing object detection and tracking for small and ambiguous objects","authors":"Said Baz Jahfar Khan , Chuanyue Li , Peng Zhang","doi":"10.1016/j.jvcir.2025.104549","DOIUrl":"10.1016/j.jvcir.2025.104549","url":null,"abstract":"<div><div>Multi-object tracking (MOT) is an essential task in computer vision, but it still faces significant challenges in real-world applications, especially with small, ambiguous, and occluded objects in crowded environments. The research study introduces FocusTrack, an innovative and robust one-stage multi-object tracking system to improve object detection and trajectory association in challenging conditions. FocusTrack initiates by fine-tuning YOLOv10, a modern high-performance detector, across many datasets (MOT17, MOT20, CityPersons, ETHZ, and CrowdHuman). We use copy-paste augmentation on essential training datasets to improve the detection of small and distant objects, therefore significantly improving performance in intricate visual environments.</div><div>To ensure precise and consistent tracking, FocusTrack introduces several vital modules: Modified Soft Buffered IoU (MS-BIoU) for adaptive IoU matching dependent on object sizes and detection confidence; Adaptive Similarity Enhancement (ASE) for the improvement of similarity matrices through occlusion-aware, motion-scaled, and size-weighted adjustments; and Spatial-Temporal Confidence Enhancement (STCE) to dynamically improve detection confidence using spatial overlap, motion patterns, and crowd density. Furthermore, our Track Recovery and Association Refinement (TRAR) module recovers missing objects via adaptive re-association techniques, while SV-Link ensures motion-aware, occlusion-resistant associations, and SOTS improves trajectories using Gaussian Process Regression specific for object dimensions and occlusion intensity.</div><div>After evaluation using the challenging MOT17 and MOT20 benchmarks, FocusTrack achieves HOTA scores of 66.91 and 66.5, MOTA scores of 82.32 and 77.9, and IDF1 scores of 82.96 and 82.1, respectively—exceeding other leading online trackers such as BoostTrack++ and BoT-SORT. The results confirm FocusTrack as a very efficient, real-time MOT framework, especially successful at handling complex and crowded environments with small or partially hidden objects.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104549"},"PeriodicalIF":3.1,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Vision-Language Prompt Learners for Learning with Noisy Labels","authors":"Changhui Hu , Bhalaji Nagarajan , Ricardo Marques , Petia Radeva","doi":"10.1016/j.jvcir.2025.104550","DOIUrl":"10.1016/j.jvcir.2025.104550","url":null,"abstract":"<div><div>Training deep learning models requires manual labelling of a large volume of diverse data that is a tedious and time-consuming process. As humans are prone to errors, large-scale data labelling often introduces label noise, leading to degradation in the performance of deep neural networks. Recently, pre-trained models on extensive multi-modal data have shown remarkable performance in computer vision tasks. However, their use to tackle the problem of learning with noisy labels is still in its infancy, due to high computational complexity and training costs. In this work, we propose a novel approach, AVL-Prompter, to effectively leverage vision-language-pre-trained models for learning with noisy labels. The key idea of our method is the use of shared deep learnable prompts between visual and textual encoders, allowing us to effectively adapt large V-L models to the downstream task of learning with noisy labels. Our technique exhibits superior performance, particularly in high-noise settings, outperforming state-of-the-art methods in several datasets with synthetic and real label noise. Our contribution comes from a novel, simple, but highly efficient methodological path to learning with noisy labels while remaining straightforward to implement. The code is available at <span><span>https://github.com/bhalajin/AVL-Prompter</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104550"},"PeriodicalIF":3.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144779775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qi Xu , Zhuoming Xu , Yan Tang , Yun Chen , Huabin Wang , Liang Tao
{"title":"Effective sparse tracking with convolution-based discriminative sparse appearance model","authors":"Qi Xu , Zhuoming Xu , Yan Tang , Yun Chen , Huabin Wang , Liang Tao","doi":"10.1016/j.jvcir.2025.104547","DOIUrl":"10.1016/j.jvcir.2025.104547","url":null,"abstract":"<div><div>Existing sparse appearance models, which rely on a sparse linear combination of dictionary atoms, often fall short in leveraging the hierarchical features within the foreground region and the discriminative features that distinguish the foreground from the background. To address these limitations, we propose a novel sparse appearance model called the Convolutional Discriminative Sparse Appearance (CDSA) model. Unlike existing sparse appearance models, the CDSA model is constructed by convolving a set of sparse filters with input images. These filters are designed to highlight the distinctions between foreground and background regions, making the CDSA model discriminative. Additionally, by stacking the convolutional feature maps, the CDSA model captures hierarchical features within the target object. We also propose a robust updating scheme that leverages high-confidence tracking results to mitigate model corruption due to severe occlusion. Extensive experiments on the OTB100 and UAV123@10_fps datasets demonstrate that the proposed CDSA-based sparse tracker outperforms existing sparse trackers and several state-of-the-art tracking methods in terms of tracking accuracy and robustness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104547"},"PeriodicalIF":3.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144779739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haichao Liu , Jiangwei Qin , Haoyu Liang , Miao Yu , Shijia Lou , Yang Luo
{"title":"ATrans: Improving single object tracking based on dual attention","authors":"Haichao Liu , Jiangwei Qin , Haoyu Liang , Miao Yu , Shijia Lou , Yang Luo","doi":"10.1016/j.jvcir.2025.104553","DOIUrl":"10.1016/j.jvcir.2025.104553","url":null,"abstract":"<div><div>The current mainstream Siamese-based object tracking methods usually match the local regions of two video frames. This regional association method ignores the global features of object modeling. To solve the robustness of long-term object tracking and improve the efficiency of object tracking to a certain extent, we propose a new tracking framework based on the dual attention mechanism, named ATrans. Our core design is based on the flexibility of the attention mechanism. We propose a dual attention module to obtain more precise features and enhance the robustness of feature extraction by paying attention to contextual information. We construct our ATrans tracking framework by stacking multiple encoders with dual attention modules and a decoder and placing a localization head on top. In addition, to solve the drift problem in the long-term object tracking process, we add an online update mechanism to the encoder structure to dynamically update the target template to enhance the robustness of the long-term tracking process. At the same time, to further improve the efficiency of the model, we propose a background removal module to reduce the amount of computation by discarding unnecessary background areas during the object tracking process. Experiments show that our tracker performs well on large datasets such as Lasot, Got10k, and TrackingNet.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104553"},"PeriodicalIF":3.1,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144771818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reversible data hiding in encrypted 3D mesh models via ripple prediction","authors":"Shuying Xu , Ji-Hwei Horng , Ching-Chun Chang , Chin-Chen Chang","doi":"10.1016/j.jvcir.2025.104551","DOIUrl":"10.1016/j.jvcir.2025.104551","url":null,"abstract":"<div><div>Reversible data hiding in the encrypted domain has attracted increasing attention for securing secret data while protecting the sensitive content of the host media. We propose a reversible data hiding method for encrypted 3D mesh models using ripple prediction, which leverages the inherent continuity of the 3D structure to achieve accurate vertex prediction. The method begins by selecting a specific vertex as a reference. This reference vertex is then used to predict a ring of directly connected vertices, which in turn predict an outer ring of their connected vertices, and so on. This ripple-like prediction process continues until all vertices have been processed. This approach enables precise vertex coordinate prediction using only a single reference vertex. As a result, the 3D mesh data can be significantly compressed, creating spare capacity for secret data embedding. Experimental results demonstrate that our method outperforms state-of-the-art schemes in terms of embedding rate.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104551"},"PeriodicalIF":3.1,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144779740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tiecheng Song , Yi Peng , Chun Liu , Anyong Qin , Yue Zhao , Feng Yang , Chenqiang Gao
{"title":"Occlusion-aware multi-person pose estimation with keypoint grouping and dual-prompt guidance in crowded scenes","authors":"Tiecheng Song , Yi Peng , Chun Liu , Anyong Qin , Yue Zhao , Feng Yang , Chenqiang Gao","doi":"10.1016/j.jvcir.2025.104545","DOIUrl":"10.1016/j.jvcir.2025.104545","url":null,"abstract":"<div><div>Multi-person pose estimation (MPPE) in crowded scenes is a challenging task due to severe keypoint occlusions. Although great progress has been made in learning effective joint features for MPPE, existing methods still have two problems. (1) They seldom consider the movement characteristics of human joints and fail to adopt distinct processing strategies to describe different types of joints. (2) They only use simple joint names as text prompts, failing to mine other informative text hints to represent detailed joint situations. To address these two problems, in this paper we propose an occlusion-aware MPPE method by exploring keypoint grouping and dual-prompt guidance (KDG). KDG adopts a distillation learning framework which contains a student network and a teacher network. In the student network, we perform instance decoupling and propose a keypoint grouping strategy to learn global and local context features for two types of joints by considering their movement flexibility. In the teacher network, we introduce the vision-language model to represent the detailed joint situations and explore dual prompts, i.e., rough body part prompts and fine-grained joint prompts, to align text and visual features. Finally, we design loss functions to train the whole network and effectively transfer the rich vision-language knowledge contained in the teacher network to the student network. Experimental results on two benchmark datasets demonstrate the superiority of our KDG over state-of-the-art methods for MMPE in crowded and occluded scenes. The source codes are available at <span><span>https://github.com/stc-cqupt/KDG</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104545"},"PeriodicalIF":3.1,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144771814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}