Journal of Visual Communication and Image Representation最新文献_第4页

MP-YOLO: multidimensional feature fusion based layer adaptive pruning YOLO for dense vehicle object detection algorithm MP-YOLO：基于多维特征融合的层自适应剪枝YOLO密集车辆目标检测算法

IF 3.1 4区计算机科学

Journal of Visual Communication and Image Representation Pub Date : 2025-08-18 DOI: 10.1016/j.jvcir.2025.104560

Wanzhen Zhou , Junjie Wang , Xi Meng , Jianxia Wang , Yufei Song , Zhiguo Liu

{"title":"MP-YOLO: multidimensional feature fusion based layer adaptive pruning YOLO for dense vehicle object detection algorithm","authors":"Wanzhen Zhou , Junjie Wang , Xi Meng , Jianxia Wang , Yufei Song , Zhiguo Liu","doi":"10.1016/j.jvcir.2025.104560","DOIUrl":"10.1016/j.jvcir.2025.104560","url":null,"abstract":"<div><div>In recent years, artificial intelligence technology has been applied in the research and development of autonomous vehicles. However, the high energy consumption of artificial intelligence models and the high precision requirements of object detection in autonomous driving have led to a stagnation in the development of autonomous vehicles. To alleviate the above problems, we optimize YOLOv8 and propose a lightweight vehicle object detection algorithm, MP-YOLO (Multidimensional feature fusion and layer adaptive pruning YOLO), to adapt to edge devices with limited storage while meeting the requirements for detection accuracy. Firstly, two multi-scale feature fusion modules, MSFB and HFF, are proposed to merge features of different dimensions, enhancing the model’s feature learning capability. Secondly, a detection head at a scale of 160*160 is added to improve small object detection capability. Thirdly, the WIoU loss function replaces the original CIOU loss function in YOLOv8 to address the issue of high overlap among road objects. Lastly, using the Layer Adaptive Sparsity for Magnitude-based Pruning (LAMP) method to significantly reduce model size. The MP-YOLO model was tested on the latest automatic driving dataset DAIR-V2X, and the results showed that the performance of the proposed MP-YOLO exceeded the original model, with improvements of 4.7 % in AP<sub>50</sub> and 4.2 % in AP, and the model size changed from the initial 6 MB to 2.2 MB. It is superior to other classical detection models in terms of volume and accuracy, and meets the requirements of deployment on edge devices. The source code is available at <span><span>https://github.com/Wang-jj-zs/MP-YOLO/tree/master</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104560"},"PeriodicalIF":3.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Res2former: A multi-scale fusion based transformer feature extraction method Res2former：一种基于多尺度融合的变压器特征提取方法

IF 3.1 4区计算机科学

Journal of Visual Communication and Image Representation Pub Date : 2025-08-18 DOI: 10.1016/j.jvcir.2025.104546

Bojun Xie, Yanjie Wang, Shaocong Guo, Junfen Chen

{"title":"Res2former: A multi-scale fusion based transformer feature extraction method","authors":"Bojun Xie, Yanjie Wang, Shaocong Guo, Junfen Chen","doi":"10.1016/j.jvcir.2025.104546","DOIUrl":"10.1016/j.jvcir.2025.104546","url":null,"abstract":"<div><div>In this paper, we propose Res2former, a novel lightweight hybrid architecture that combines convolutional neural networks (CNNs) and Transformers to effectively model both local and global dependencies in visual data. While Vision Transformer (ViT) demonstrates strong global modeling capability, it lack locality and translation-invariance, making it reliant on large-scale datasets and computational resources. To address this, Res2former adopts a stage-wise hybrid design: in shallow layers, CNNs replace Transformer blocks to exploit local inductive biases and reduce early computational cost; in deeper layers, we introduce a multi-scale fusion mechanism by embedding multiple parallel convolutional kernels of varying receptive fields into the Transformer’s MLP structure. This enables Res2former to capture multi-scale visual semantics more effectively and fuse features across different scales. Experimental results reveal that with the same parameters and computational complexity, Res2former outperforms variants of Transformer and CNN models in image classification (80.7 top-1 accuracy on ImageNet-1K), object detection (45.8 <span><math><mrow><mi>A</mi><msup><mrow><mi>P</mi></mrow><mrow><mi>b</mi><mi>o</mi><mi>x</mi></mrow></msup></mrow></math></span> on the COCO 2017 Validation Set), and instance segmentation (41.0 <span><math><mrow><mi>A</mi><msup><mrow><mi>P</mi></mrow><mrow><mi>m</mi><mi>a</mi><mi>s</mi><mi>k</mi></mrow></msup></mrow></math></span> on the COCO 2017 Validation Set) tasks. The code is publicly accessible at <span><span>https://github.com/hand-Max/Res2former</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"112 ","pages":"Article 104546"},"PeriodicalIF":3.1,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Two-dimensional normalized knowledge distillation leveraging class relations 利用类关系的二维规范化知识蒸馏

IF 3.1 4区计算机科学

Journal of Visual Communication and Image Representation Pub Date : 2025-08-15 DOI: 10.1016/j.jvcir.2025.104557

Benhong Zhang, Yiren Song, Yidong Zhang, Xiang Bi

引用次数: 0

Salient object detection enhanced pseudo-labels for weakly supervised semantic segmentation 显著目标检测增强弱监督语义分割伪标签

IF 3.1 4区计算机科学

Journal of Visual Communication and Image Representation Pub Date : 2025-08-14 DOI: 10.1016/j.jvcir.2025.104548

Yunping Zheng , Zhou Jiang , Shiqiang Shu , Yuze Zhu , Zejun Wang , Mudar Sarem

{"title":"Salient object detection enhanced pseudo-labels for weakly supervised semantic segmentation","authors":"Yunping Zheng , Zhou Jiang , Shiqiang Shu , Yuze Zhu , Zejun Wang , Mudar Sarem","doi":"10.1016/j.jvcir.2025.104548","DOIUrl":"10.1016/j.jvcir.2025.104548","url":null,"abstract":"<div><div>To address the limitations of generating the pseudo-labels based on Class Activation Maps (CAM) in the weakly supervised semantic segmentation tasks, in this paper, we propose a novel salient object fusion framework. This framework complements CAM localization information by capturing the complete contours and the edge details of salient targets through our proposed RGB-SOD network. Also, we design a saliency object selector to dynamically balance the weights of CAM and Salient Object Detection (SOD) when generating the single-class pseudo-labels, further improving the quality of the pseudo-labels. Despite its simplicity, our method achieved competitive performances of 77.52% and 77.73% on the PASCAL VOC 2012 validation and the test sets respectively, significantly enhancing the performance bottlenecks of the SOTA methods. This work highlights the importance of effectively integrating complementary information to improve weakly supervised segmentation tasks. Our source codes are publicly available at <span><span>https://github.com/UGVly/SOD-For-WSSS.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104548"},"PeriodicalIF":3.1,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144852806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FocusTrack: Enhancing object detection and tracking for small and ambiguous objects FocusTrack：增强对小而模糊物体的检测和跟踪

IF 3.1 4区计算机科学

Journal of Visual Communication and Image Representation Pub Date : 2025-08-07 DOI: 10.1016/j.jvcir.2025.104549

Said Baz Jahfar Khan , Chuanyue Li , Peng Zhang

{"title":"FocusTrack: Enhancing object detection and tracking for small and ambiguous objects","authors":"Said Baz Jahfar Khan , Chuanyue Li , Peng Zhang","doi":"10.1016/j.jvcir.2025.104549","DOIUrl":"10.1016/j.jvcir.2025.104549","url":null,"abstract":"<div><div>Multi-object tracking (MOT) is an essential task in computer vision, but it still faces significant challenges in real-world applications, especially with small, ambiguous, and occluded objects in crowded environments. The research study introduces FocusTrack, an innovative and robust one-stage multi-object tracking system to improve object detection and trajectory association in challenging conditions. FocusTrack initiates by fine-tuning YOLOv10, a modern high-performance detector, across many datasets (MOT17, MOT20, CityPersons, ETHZ, and CrowdHuman). We use copy-paste augmentation on essential training datasets to improve the detection of small and distant objects, therefore significantly improving performance in intricate visual environments.</div><div>To ensure precise and consistent tracking, FocusTrack introduces several vital modules: Modified Soft Buffered IoU (MS-BIoU) for adaptive IoU matching dependent on object sizes and detection confidence; Adaptive Similarity Enhancement (ASE) for the improvement of similarity matrices through occlusion-aware, motion-scaled, and size-weighted adjustments; and Spatial-Temporal Confidence Enhancement (STCE) to dynamically improve detection confidence using spatial overlap, motion patterns, and crowd density. Furthermore, our Track Recovery and Association Refinement (TRAR) module recovers missing objects via adaptive re-association techniques, while SV-Link ensures motion-aware, occlusion-resistant associations, and SOTS improves trajectories using Gaussian Process Regression specific for object dimensions and occlusion intensity.</div><div>After evaluation using the challenging MOT17 and MOT20 benchmarks, FocusTrack achieves HOTA scores of 66.91 and 66.5, MOTA scores of 82.32 and 77.9, and IDF1 scores of 82.96 and 82.1, respectively—exceeding other leading online trackers such as BoostTrack++ and BoT-SORT. The results confirm FocusTrack as a very efficient, real-time MOT framework, especially successful at handling complex and crowded environments with small or partially hidden objects.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104549"},"PeriodicalIF":3.1,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive Vision-Language Prompt Learners for Learning with Noisy Labels 自适应视觉语言提示学习者对噪声标签的学习

IF 3.1 4区计算机科学

Journal of Visual Communication and Image Representation Pub Date : 2025-08-05 DOI: 10.1016/j.jvcir.2025.104550

Changhui Hu , Bhalaji Nagarajan , Ricardo Marques , Petia Radeva

{"title":"Adaptive Vision-Language Prompt Learners for Learning with Noisy Labels","authors":"Changhui Hu , Bhalaji Nagarajan , Ricardo Marques , Petia Radeva","doi":"10.1016/j.jvcir.2025.104550","DOIUrl":"10.1016/j.jvcir.2025.104550","url":null,"abstract":"<div><div>Training deep learning models requires manual labelling of a large volume of diverse data that is a tedious and time-consuming process. As humans are prone to errors, large-scale data labelling often introduces label noise, leading to degradation in the performance of deep neural networks. Recently, pre-trained models on extensive multi-modal data have shown remarkable performance in computer vision tasks. However, their use to tackle the problem of learning with noisy labels is still in its infancy, due to high computational complexity and training costs. In this work, we propose a novel approach, AVL-Prompter, to effectively leverage vision-language-pre-trained models for learning with noisy labels. The key idea of our method is the use of shared deep learnable prompts between visual and textual encoders, allowing us to effectively adapt large V-L models to the downstream task of learning with noisy labels. Our technique exhibits superior performance, particularly in high-noise settings, outperforming state-of-the-art methods in several datasets with synthetic and real label noise. Our contribution comes from a novel, simple, but highly efficient methodological path to learning with noisy labels while remaining straightforward to implement. The code is available at <span><span>https://github.com/bhalajin/AVL-Prompter</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104550"},"PeriodicalIF":3.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144779775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effective sparse tracking with convolution-based discriminative sparse appearance model 基于卷积判别稀疏外观模型的有效稀疏跟踪

IF 3.1 4区计算机科学

Journal of Visual Communication and Image Representation Pub Date : 2025-08-05 DOI: 10.1016/j.jvcir.2025.104547

Qi Xu , Zhuoming Xu , Yan Tang , Yun Chen , Huabin Wang , Liang Tao

{"title":"Effective sparse tracking with convolution-based discriminative sparse appearance model","authors":"Qi Xu , Zhuoming Xu , Yan Tang , Yun Chen , Huabin Wang , Liang Tao","doi":"10.1016/j.jvcir.2025.104547","DOIUrl":"10.1016/j.jvcir.2025.104547","url":null,"abstract":"<div><div>Existing sparse appearance models, which rely on a sparse linear combination of dictionary atoms, often fall short in leveraging the hierarchical features within the foreground region and the discriminative features that distinguish the foreground from the background. To address these limitations, we propose a novel sparse appearance model called the Convolutional Discriminative Sparse Appearance (CDSA) model. Unlike existing sparse appearance models, the CDSA model is constructed by convolving a set of sparse filters with input images. These filters are designed to highlight the distinctions between foreground and background regions, making the CDSA model discriminative. Additionally, by stacking the convolutional feature maps, the CDSA model captures hierarchical features within the target object. We also propose a robust updating scheme that leverages high-confidence tracking results to mitigate model corruption due to severe occlusion. Extensive experiments on the OTB100 and UAV123@10_fps datasets demonstrate that the proposed CDSA-based sparse tracker outperforms existing sparse trackers and several state-of-the-art tracking methods in terms of tracking accuracy and robustness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104547"},"PeriodicalIF":3.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144779739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ATrans: Improving single object tracking based on dual attention atranss：基于双重注意改进单目标跟踪

IF 3.1 4区计算机科学

Journal of Visual Communication and Image Representation Pub Date : 2025-07-31 DOI: 10.1016/j.jvcir.2025.104553

Haichao Liu , Jiangwei Qin , Haoyu Liang , Miao Yu , Shijia Lou , Yang Luo

{"title":"ATrans: Improving single object tracking based on dual attention","authors":"Haichao Liu , Jiangwei Qin , Haoyu Liang , Miao Yu , Shijia Lou , Yang Luo","doi":"10.1016/j.jvcir.2025.104553","DOIUrl":"10.1016/j.jvcir.2025.104553","url":null,"abstract":"<div><div>The current mainstream Siamese-based object tracking methods usually match the local regions of two video frames. This regional association method ignores the global features of object modeling. To solve the robustness of long-term object tracking and improve the efficiency of object tracking to a certain extent, we propose a new tracking framework based on the dual attention mechanism, named ATrans. Our core design is based on the flexibility of the attention mechanism. We propose a dual attention module to obtain more precise features and enhance the robustness of feature extraction by paying attention to contextual information. We construct our ATrans tracking framework by stacking multiple encoders with dual attention modules and a decoder and placing a localization head on top. In addition, to solve the drift problem in the long-term object tracking process, we add an online update mechanism to the encoder structure to dynamically update the target template to enhance the robustness of the long-term tracking process. At the same time, to further improve the efficiency of the model, we propose a background removal module to reduce the amount of computation by discarding unnecessary background areas during the object tracking process. Experiments show that our tracker performs well on large datasets such as Lasot, Got10k, and TrackingNet.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104553"},"PeriodicalIF":3.1,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144771818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reversible data hiding in encrypted 3D mesh models via ripple prediction 可逆数据隐藏在加密3D网格模型通过波纹预测

IF 3.1 4区计算机科学

Journal of Visual Communication and Image Representation Pub Date : 2025-07-31 DOI: 10.1016/j.jvcir.2025.104551

Shuying Xu , Ji-Hwei Horng , Ching-Chun Chang , Chin-Chen Chang

引用次数: 0

Occlusion-aware multi-person pose estimation with keypoint grouping and dual-prompt guidance in crowded scenes 拥挤场景中基于关键点分组和双提示引导的闭塞感知多人姿态估计

IF 3.1 4区计算机科学

Journal of Visual Communication and Image Representation Pub Date : 2025-07-31 DOI: 10.1016/j.jvcir.2025.104545

Tiecheng Song , Yi Peng , Chun Liu , Anyong Qin , Yue Zhao , Feng Yang , Chenqiang Gao

{"title":"Occlusion-aware multi-person pose estimation with keypoint grouping and dual-prompt guidance in crowded scenes","authors":"Tiecheng Song , Yi Peng , Chun Liu , Anyong Qin , Yue Zhao , Feng Yang , Chenqiang Gao","doi":"10.1016/j.jvcir.2025.104545","DOIUrl":"10.1016/j.jvcir.2025.104545","url":null,"abstract":"<div><div>Multi-person pose estimation (MPPE) in crowded scenes is a challenging task due to severe keypoint occlusions. Although great progress has been made in learning effective joint features for MPPE, existing methods still have two problems. (1) They seldom consider the movement characteristics of human joints and fail to adopt distinct processing strategies to describe different types of joints. (2) They only use simple joint names as text prompts, failing to mine other informative text hints to represent detailed joint situations. To address these two problems, in this paper we propose an occlusion-aware MPPE method by exploring keypoint grouping and dual-prompt guidance (KDG). KDG adopts a distillation learning framework which contains a student network and a teacher network. In the student network, we perform instance decoupling and propose a keypoint grouping strategy to learn global and local context features for two types of joints by considering their movement flexibility. In the teacher network, we introduce the vision-language model to represent the detailed joint situations and explore dual prompts, i.e., rough body part prompts and fine-grained joint prompts, to align text and visual features. Finally, we design loss functions to train the whole network and effectively transfer the rich vision-language knowledge contained in the teacher network to the student network. Experimental results on two benchmark datasets demonstrate the superiority of our KDG over state-of-the-art methods for MMPE in crowded and occluded scenes. The source codes are available at <span><span>https://github.com/stc-cqupt/KDG</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104545"},"PeriodicalIF":3.1,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144771814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0