{"title":"MVDT: Multiview Distillation Transformer for View-Invariant Sign Language Translation","authors":"Zhong Guan, Yongli Hu, Huajie Jiang, Yanfeng Sun, Baocai Yin","doi":"10.1049/cvi2.70038","DOIUrl":"https://doi.org/10.1049/cvi2.70038","url":null,"abstract":"<p>Sign language translation based on machine learning plays a crucial role in facilitating communication between deaf and hearing individuals. However, due to the complexity and variability of sign language, coupled with limited observation angles, single-view sign language translation models often underperform in real-world applications. Although some studies have attempted to improve translation efficiency by incorporating multiview data, challenges, such as feature alignment, fusion, and the high cost of capturing multiview data, remain significant barriers in many practical scenarios. To address these issues, we propose a multiview distillation transformer model (MVDT) for continuous sign language translation. The MVDT introduces a novel distillation mechanism, where a teacher model is designed to learn common features from multiview data, subsequently guiding a student model to extract view-invariant features using only single-view input. To evaluate the proposed method, we construct a multiview sign language dataset comprising five distinct views and conduct extensive experiments comparing the MVDT with state-of-the-art methods. Experimental results demonstrate that the proposed model exhibits superior view-invariant translation capabilities across different views.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Multiscale Attention Feature Aggregation for Multi-Modal 3D Occluded Object Detection","authors":"Yanfeng Han, Ming Yu, Jing Liu","doi":"10.1049/cvi2.70035","DOIUrl":"https://doi.org/10.1049/cvi2.70035","url":null,"abstract":"<p>Accurate perception and understanding of the three-dimensional environment is crucial for autonomous vehicles to navigate efficiently and make wise decisions. However, in complex real-world scenarios, the information obtained by a single-modal sensor is often incomplete, severely affecting the detection accuracy of occluded targets. To address this issue, this paper proposes a novel adaptive multi-scale attention aggregation strategy, efficiently fusing multi-scale feature representations of heterogeneous data to accurately capture the shape details and spatial relationships of targets in three-dimensional space. This strategy utilises learnable sparse keypoints to dynamically align heterogeneous features in a data-driven manner, adaptively modelling the cross-modal mapping relationships between keypoints and their corresponding multi-scale image features. Given the importance of accurately obtaining the three-dimensional shape information of targets for understanding the size and rotation pose of occluded targets, this paper adopts a shape prior knowledge-based constraint method and data augmentation strategy to guide the model to more accurately perceive the complete three-dimensional shape and rotation pose of occluded targets. Experimental results show that our proposed model achieves 2.15%, 3.24% and 2.75% improvement in 3D<sub>R40</sub> mAP score under the easy, moderate and hard difficulty levels compared to MVXNet, significantly enhancing the detection accuracy and robustness of occluded targets in complex scenarios.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70035","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144647572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youcef Sklab, Hanane Ariouat, Eric Chenin, Edi Prifti, Jean-Daniel Zucker
{"title":"SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds From RGB Images for 2D Classification","authors":"Youcef Sklab, Hanane Ariouat, Eric Chenin, Edi Prifti, Jean-Daniel Zucker","doi":"10.1049/cvi2.70036","DOIUrl":"https://doi.org/10.1049/cvi2.70036","url":null,"abstract":"<p>We introduce the shape-image multimodal network (SIM-Net), a novel 2D image classification architecture that integrates 3D point cloud representations inferred directly from RGB images. Our key contribution lies in a pixel-to-point transformation that converts 2D object masks into 3D point clouds, enabling the fusion of texture-based and geometric features for enhanced classification performance. SIM-Net is particularly well-suited for the classification of digitised herbarium specimens—a task made challenging by heterogeneous backgrounds, nonplant elements, and occlusions that compromise conventional image-based models. To address these issues, SIM-Net employs a segmentation-based preprocessing step to extract object masks prior to 3D point cloud generation. The architecture comprises a CNN encoder for 2D image features and a PointNet-based encoder for geometric features, which are fused into a unified latent space. Experimental evaluations on herbarium datasets demonstrate that SIM-Net consistently outperforms ResNet101, achieving gains of up to 9.9% in accuracy and 12.3% in F-score. It also surpasses several transformer-based state-of-the-art architectures, highlighting the benefits of incorporating 3D structural reasoning into 2D image classification tasks.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70036","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved SAR Aircraft Detection Algorithm Based on Visual State Space Models","authors":"Yaqiong Wang, Jing Zhang, Yipei Wang, Shiyu Hu, Baoguo Shen, Zhenhua Hou, Wanting Zhou","doi":"10.1049/cvi2.70032","DOIUrl":"https://doi.org/10.1049/cvi2.70032","url":null,"abstract":"<p>In recent years, the development of deep learning algorithms has significantly advanced the application of synthetic aperture radar (SAR) aircraft detection in remote sensing and military fields. However, existing methods face a dual dilemma: CNN-based models suffer from insufficient detection accuracy due to limitations in local receptive fields, whereas Transformer-based models improve accuracy by leveraging attention mechanisms but incur significant computational overhead due to their quadratic complexity. This imbalance between accuracy and efficiency severely limits the development of SAR aircraft detection. To address this problem, this paper propose a novel neural network based on state space models (SSM), termed the Mamba SAR detection network (MSAD). Specifically, we design a feature encoding module, MEBlock, that integrates CNN with SSM to enhance global feature modelling capabilities. Meanwhile, the linear computational complexity brought by SSM is superior to that of Transformer architectures, achieving a reduction in computational overhead. Additionally, we propose a context-aware feature fusion module (CAFF) that combines attention mechanisms to achieve adaptive fusion of multi-scale features. Lastly, a lightweight parameter-shared detection head (PSHead) is utilised to effectively reduce redundant parameters through implicit feature interaction. Experiments on the SAR-AirCraft-v1.0 and SADD datasets show that MSAD achieves higher accuracy than existing algorithms, whereas its GFLOPs are 2.7 times smaller than those of the Transformer architecture RT-DETR. These results validate the core role of SSM as an accuracy-efficiency balancer, reflecting MSAD's perceptual capability and performance in SAR aircraft detection in complex environments.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70032","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144574132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Attention Fusion Artistic Radiance Fields and Beyond","authors":"Qianru Chen, Yufan Zhou, Xintong Hou, Kunze Jiang, Jincheng Li, Chao Wu","doi":"10.1049/cvi2.70017","DOIUrl":"https://doi.org/10.1049/cvi2.70017","url":null,"abstract":"<p>We present MRF (multi-attention fusion artistic radiance fields), a novel approach to 3D scene stylisation that synthesises artistic rendering by integrating stylised 2D images with neural radiance fields. Our method effectively incorporates high-frequency stylistic elements from 2D artistic representations while maintaining geometric consistency across multiple viewpoints. To address the challenges of view-dependent stylisation coherence and semantic fidelity, we introduce two key components: (1) a multi-scale attention module (MAM) that facilitates hierarchical feature extraction and fusion across different spatial resolutions and (2) a CLIP-guided semantic consistency module that preserves the underlying scene structure during style transfer. Through extensive experimentation, we demonstrate that MRF achieves superior stylisation quality and detail preservation compared to state-of-the-art methods, particularly in capturing fine artistic details while maintaining view consistency. Our approach represents a significant advancement in neural rendering-based artistic stylisation of 3D scenes.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70017","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144331868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PAD: Detail-Preserving Point Cloud Reconstruction and Generation via Autodecoders","authors":"Yakai Zhang, Ping Yang, Haoran Wang, Zizhao Wu, Xiaoling Gu, Alexandru Telea, Kosinka Jiri","doi":"10.1049/cvi2.70031","DOIUrl":"https://doi.org/10.1049/cvi2.70031","url":null,"abstract":"<p>High-accuracy point cloud (self-) reconstruction is crucial for point cloud editing, translation, and unsupervised representation learning. However, existing point cloud reconstruction methods often sacrifice many geometric details. Altough many techniques have proposed how to construct better point cloud decoders, only a few have designed point cloud encoders from a reconstruction perspective. We propose an autodecoder architecture to achieve detail-preserving point cloud reconstruction while bypassing the performance bottleneck of the encoder. Our architecture is theoretically applicable to any existing point cloud decoder. For training, both the weights of the decoder and the pre-initialised latent codes, corresponding to the input points, are updated simultaneously. Experimental results demonstrate that our autodecoder achieves an average reduction of 24.62% in Chamfer Distance compared to existing methods, significantly improving reconstruction quality on the ShapeNet dataset. Furthermore, we verify the effectiveness of our autodecoder in point cloud generation, upsampling, and unsupervised representation learning to demonstrate its performance on downstream tasks, which is comparable to the state-of-the-art methods. We will make our code publicly available after peer review.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70031","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144315358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanlei Wei, Xiaolin Zhang, Yongping Wang, Jingyu Wang, Lixin Liu
{"title":"GRVT: Improving the Transferability of Adversarial Attacks Through Gradient Related Variance and Input Transformation","authors":"Yanlei Wei, Xiaolin Zhang, Yongping Wang, Jingyu Wang, Lixin Liu","doi":"10.1049/cvi2.70034","DOIUrl":"https://doi.org/10.1049/cvi2.70034","url":null,"abstract":"<p>As we all know, the emergence of a large number of adversarial samples reveals the vulnerability of deep neural networks. Attackers seriously affect the performance of models by adding imperceptible perturbations. Although adversarial samples have a high transferability success rate in white-box models, they are less effective in black-box models. To address this problem, this paper proposes a new transferability attack strategy, Gradient Related Variance and Input Transformation Attack (GRVT). First, the image is divided into small blocks, and random transformations are applied to each block to generate diversified images; then, in the gradient update process, the gradient of the neighbourhood area is introduced, and the current gradient is associated with the neighbourhood average gradient through Cosine Similarity. The current gradient direction is adjusted using the associated gradient combined with the previous gradient variance, and a step size reducer adjusts the gradient step size. Experiments on the ILSVRC 2012 dataset show that the transferability success rate of adversarial samples between convolutional neural network (CNN) and vision transformer (ViT) models is higher than that of currently advanced methods. Additionally, the adversarial samples generated on the ensemble model are practical against nine defence strategies. GRVT shows excellent transferability and broad applicability.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70034","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhoufeng Liu, Bingrui Li, Miao Yu, Guangshuai Gao, Chunlei Li
{"title":"Enhanced Foreground–Background Discrimination for Weakly Supervised Semantic Segmentation","authors":"Zhoufeng Liu, Bingrui Li, Miao Yu, Guangshuai Gao, Chunlei Li","doi":"10.1049/cvi2.70029","DOIUrl":"https://doi.org/10.1049/cvi2.70029","url":null,"abstract":"<p>Weakly supervised semantic segmentation (WSSS) methods are extensively studied due to the availability of image-level annotations. Relying on class activation maps (CAMs) derived from original classification networks often suffers from issues such as inaccurate object localization, incomplete object regions, and the inclusion of confusing background pixels. To address these issues, we propose a two-stage method that enhances the foreground–background discriminative ability in a global context (FB-DGC). Specifically, a cross-domain feature calibration module (CFCM) is first proposed to calibrate foreground and background salient features using global spatial location information, thereby expanding foreground features while mitigating the impact of inaccurate localization in class activation regions. A class-specific distance module (CSDM) is further adopted to facilitate the separation of foreground–background features, thereby enhancing the activation of target regions, which alleviates the over-smoothing of features produced by the network and mitigates issues associated with confused features. In addition, an adaptive edge feature extraction (AEFE) strategy is proposed to identify target features in candidate boundary regions and capture missed features, compensating for drawbacks in recognising the co-occurrence of multiple targets. The proposed method is extensively evaluated on the challenging PASCAL VOC 2012 and MS COCO 2014 datasets, demonstrating its feasibility and superiority.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70029","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144244550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mamba4SOD: RGB-T Salient Object Detection Using Mamba-Based Fusion Module","authors":"Yi Xu, Ruichao Hou, Ziheng Qi, Tongwei Ren","doi":"10.1049/cvi2.70033","DOIUrl":"https://doi.org/10.1049/cvi2.70033","url":null,"abstract":"<p>RGB and thermal salient object detection (RGB-T SOD) aims to accurately locate and segment salient objects in aligned visible and thermal image pairs. However, existing methods often struggle to produce complete masks and sharp boundaries in challenging scenarios due to insufficient exploration of complementary features from the dual modalities. In this paper, we propose a novel mamba-based fusion network for RGB-T SOD task, named Mamba4SOD, which integrates the strengths of Swin Transformer and Mamba to construct robust multi-modal representations, effectively reducing pixel misclassification. Specifically, we leverage Swin Transformer V2 to establish long-range contextual dependencies and thoroughly analyse the impact of features at various levels on detection performance. Additionally, we develop a novel Mamba-based fusion module with linear complexity, boosting multi-modal enhancement and fusion. Experimental results on VT5000, VT1000 and VT821 datasets demonstrate that our method outperforms the state-of-the-art RGB-T SOD methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144220255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Object Detection Based on CNN and Vision-Transformer: A Survey","authors":"Jinfeng Cao, Bo Peng, Mingzhong Gao, Haichun Hao, Xinfang Li, Hongwei Mou","doi":"10.1049/cvi2.70028","DOIUrl":"https://doi.org/10.1049/cvi2.70028","url":null,"abstract":"<p>Object detection is the most crucial and challenging task of computer vision and has been used in various fields in recent years, such as autonomous driving and industrial inspection. Traditional object detection methods are mainly based on the sliding windows and the handcrafted features, which have problems such as insufficient understanding of image features and low accuracy of detection. With the rapid advancements in deep learning, convolutional neural networks (CNNs) and vision transformers have become fundamental components in object detection models. These components are capable of learning more advanced and deeper image properties, leading to a transformational breakthrough in the performance of object detection. In this review, we comprehensively review the representative object detection models from deep learning periods, tracing their architectural shifts and technological breakthroughs. Furthermore, we discuss key challenges and promising research directions in the object detection. This review aims to provide a comprehensive foundation for practitioners to enhance their understanding of object detection technologies.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70028","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144179248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}