{"title":"PointMamba++: Rethinking Ordering and Convolution Strategy of State Space Model for Point Cloud Analysis","authors":"Ke Xu, Xinpu Liu, Yinghui Gao, Qingyong Hu, Xinjie Wang, Hanyun Wang, Yulan Guo","doi":"10.1049/cvi2.70064","DOIUrl":"https://doi.org/10.1049/cvi2.70064","url":null,"abstract":"<p>With the linear complexity and long-sequence global modelling capability, Mamba becomes a competitor to Transformer architectures in point clouds analysis. However, the designs of traditional 1D convolutions and reordering strategies, do not match the inherently unordered nature of point clouds, which constrain the performance enhancement. In this work, we rethink the ordering and convolution strategy of the PointMamba, and present a novel architecture named PointMamba++ to more effectively aggregate local structural features and achieve a superior accuracy and computation trade-off. Specifically, we design a point-edge convolution to aggregate neighbourhood features of point cloud tokens, which replaces 1D convolution layers in traditional Mamba modules and does not perform convolution by sequence but according to geometric relationships. Furthermore, considering that forcibly ordering point clouds is not conducive to learning local geometric features and easily leads to unstable sequence dependencies, we design a sequence-independent BiMamba module, which adopts two reverse and parallel scanning paths, to reduce the dependency on sequential scanning of Mamba while enhancing point cloud representation abilities. Extensive experiments show that PointMamba++ surpasses typical convolution-based and Transformer-based architectures, and achieves state-of-the-art performance on multiple tasks including shape classification, part segmentation, and semantic segmentation.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2026-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70064","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147696404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Encrypt Anything: A Content-Aware Hierarchical Privacy Protection Method for Image Data","authors":"Jiawei Han, Bingxin Wu, Ying Xu, Peihang Han, Boyan Wang, Yuchen Gu","doi":"10.1049/cvi2.70057","DOIUrl":"https://doi.org/10.1049/cvi2.70057","url":null,"abstract":"<p>With the exponential expansion of image data, data privacy security is facing significant challenges. The privacy protection methods that are currently in use generally suffer from inefficiency and absence of semantic understanding, resulting in either excessive encryption or insufficient protection, making it difficult to comply with privacy protection requirements in complex scenarios. To this end, this paper proposes the Encrypt Anything Model (EAM), which is capable of performing fine-grained hierarchical and region-based encryption on the content within images. EAM constructs the perception unit by integrating vision foundation models, leverages cross-modal feature fusion techniques to accurately identify and segment privacy-related entities in images and applies hierarchical and region-specific encryption to different areas according to the privacy level of each entity. During the decryption phase, EAM introduces a differentiated decryption mechanism based on a permission matrix, which controls the image content that users are allowed to recover through dynamic token allocation, thereby enabling multilevel privacy protection. Experimental results across multiple privacy scenarios validate the superior performance of EAM in terms of detection accuracy and privacy entity coverage. Qualitative analysis further demonstrates that the model can effectively obscure sensitive information while maximising the usability of nonsensitive regions. By constructing a complete pipeline of ‘perceptual recognition, hierarchical encryption and differentiated decryption’, this study achieves fine-grained governance of image privacy, providing a flexible and extensible general framework to meet multilevel privacy protection requirements and image privacy governance needs in open scenarios.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2026-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70057","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147567888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RainReID: Person Re-Identification in Rainy Weather and a Large-Scale Dataset","authors":"Zixie Guo, Ke Xu, Qingming Leng","doi":"10.1049/cvi2.70063","DOIUrl":"https://doi.org/10.1049/cvi2.70063","url":null,"abstract":"<p>Person re-identification (Re-ID) is a fundamental problem in computer vision that focuses on matching individuals across nonoverlapping camera views. It plays a crucial role in large-scale intelligent surveillance systems, enabling efficient image retrieval of specific persons of interest. Although state-of-the-art Re-ID methods have achieved remarkable progress in diverse scenarios, complex weather scenarios (e.g., rain) remain an underexplored factor. Rainy day is one of the most common types of weather, and Re-ID in rainy weather scenario is confronted with complex challenges, such as occlusions caused by umbrellas or raindrops, illumination variations during indoor–outdoor transitions and so on. To develop the Re-ID in complex weather scenarios, this work introduce the first person re-identification work under real rainy weather. A large-scale dataset, RainReID, is established after a long and arduous effort due to the difficulty of data acquisition. RainReID contains 596 identities and 27,617 images captured from 7 cameras, and involves both outdoor and indoor conditions. A dual-branch scene-adaptive (DSA) framework is proposed as a deep learning based benchmark, which is used to enhance the robustness especially for indoor–outdoor Re-ID. Extensive experiments demonstrate the challenge of RainReID, and the effectiveness of DAS is proved under rainy weather. RainReID will be released at https://github.com/Qingming-Leng/RainReID.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2026-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70063","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147566777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amani Sedrat, Takieddine Chehhat, Youcef Sklab, Hanane Ariouat, Abderrazak Sebaa, Eric Chenin, Jean-Daniel Zucker, Edi Prifti
{"title":"AT-ViT: Area-Targeted Multi-View Vision Transformer With Cross-Attention and Multi-Scale Patching for Plant Trait Recognition in Herbarium Images","authors":"Amani Sedrat, Takieddine Chehhat, Youcef Sklab, Hanane Ariouat, Abderrazak Sebaa, Eric Chenin, Jean-Daniel Zucker, Edi Prifti","doi":"10.1049/cvi2.70059","DOIUrl":"https://doi.org/10.1049/cvi2.70059","url":null,"abstract":"<p>Automated plant traits recognition from herbarium images is essential for plant sciences, yet it remains challenging because background elements (e.g., textual labels, mounting artefacts and colour charts) can introduce shortcut learning, leading models to rely on spurious nonplant cues rather than plant morphology. This bias degrades both generalisation and interpretability. In this paper, we introduce <b>AT-ViT</b>, a dual-branch vision transformer that jointly encodes raw herbarium scans and their segmented-derived counterparts via a multi-scale, multi-view cross-attention fusion scheme. AT-ViT further incorporates a mask-guided patch weighting mechanism that amplifies plant-relevant regions and attenuates background-driven features. By learning from the original scans while being guided by segmentation masks through the mask-guided patch reweighting mechanism, the model is encouraged to focus on plant organs and learn plant-centric representations more effectively. Across multiple trait classification tasks (e.g., leaf base shape, thorns), AT-ViT delivers consistent accuracy gains, improves attention localisation on plant regions and exhibits increased robustness under synthetic background perturbations. Specifically, AT-ViT substantially improves spatial attention grounding, boosting plant-region alignment (Avg IoU_p: +15.66 to +18.03 pp) while reducing background overlap (Avg IoU_b: −27.92 to −31.02 pp) relative to CrossViT, and remains markedly more robust to background perturbations, outperforming ResNet101 by up to +32.32 accuracy points and CrossViT by up to +5.07 points under background-noise conditions.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2026-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70059","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147653379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiqiang Zhao, Daitu Wen, Yuanhang Gu, Xiaoli Luo, Tao Ma, Xu Ma, Bin Wu
{"title":"MMCATrack: Multi-Modal Channel Attention Tracker","authors":"Zhiqiang Zhao, Daitu Wen, Yuanhang Gu, Xiaoli Luo, Tao Ma, Xu Ma, Bin Wu","doi":"10.1049/cvi2.70060","DOIUrl":"https://doi.org/10.1049/cvi2.70060","url":null,"abstract":"<p>Most existing Transformer-based visual object tracking methods rely exclusively on the feature map from the last encoder layer for object prediction, thereby overlooking the rich information contained in shallow and intermediate layer feature maps. This limitation reduces the representational capacity of the model. Moreover, current multi-modal tracking frameworks typically construct multi-modal features through simple concatenation, which fails to adequately account for the differential contributions of individual modalities to the final prediction task. As a result, these approaches exhibit an insufficient ability to express key features within the multi-modal representation. To address the aforementioned issues, this paper proposes a multi-modal channel attention tracking algorithm, where a multi-modal channel attention block is incorporated for the purpose of enhancing the representation ability of the key features within the multi-modal features. Specifically, the multi-modal channel attention block first aggregates multi-modal information from the multi-layer feature maps of the encoder through cross layer cascading and then applies channel attention mechanism to dynamically calibrate the channel weights in the generated multi-modal features, thereby enhancing the representation of key features. In addition, this article proposes a new regression loss function to improve localisation accuracy. Finally, abundant experiments conducted on five benchmarks including GOT-10K, TrackingNet, TNL2K, VisEvent and RGBT234 have verified the effectiveness of our theory.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2026-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70060","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147653167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Grained Vision–Language Alignment for Domain Generalised Person Re-Identification","authors":"Jiachen Li, Xiaojin Gong, Dongping Zhang","doi":"10.1049/cvi2.70062","DOIUrl":"https://doi.org/10.1049/cvi2.70062","url":null,"abstract":"<p>Domain generalised person re-identification (DG Re-ID) is a challenging task, where models are trained on source domains but tested on unseen target domains. Although previous pure vision-based models have achieved significant progress, the performance can be further improved. Recently, vision-language models (VLMs) present outstanding generalisation capabilities in various visual applications. However, directly adapting a VLM to Re-ID shows limited generalisation improvement. This is because the VLM only produces global features that are insensitive to ID nuances. To tackle this problem, we propose a CLIP-based multi-grained vision–language alignment framework in this work. Specifically, several multi-grained prompts are introduced in language modality to describe different body parts and align with their counterparts in vision modality. To obtain fine-grained visual information, an adaptively masked multi-head self-attention module is employed to precisely extract specific part features. To train the proposed module, an MLLM-based visual grounding expert is employed to automatically generate pseudo labels of body parts for supervision. Extensive experiments conducted on both single- and multi-source generalisation protocols demonstrate the superior performance of our approach. The implementation code will be released at https://github.com/RikoLi/MUVA.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2026-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70062","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cephalometric Landmark Detection Using a Multi-Scale Cross-Attention Model","authors":"Shuli Xing, Hao Liang, Guojun Mao, Yihang Zhou, Jinghui Ling, Haiyan Wang","doi":"10.1049/cvi2.70056","DOIUrl":"https://doi.org/10.1049/cvi2.70056","url":null,"abstract":"<p>The annotation of cephalometric landmarks plays a critical role in craniofacial diagnosis and treatment. Compared to conventional manual methods, deep learning-based automated approaches significantly reduce both time requirements and labour costs. Most current deep learning models primarily rely on convolutional neural networks, but these models exhibit limitations in capturing long-range dependencies between pixels. Transformer-based models can effectively address this issue; however, they exhibit poor inductive bias when applied to small-scale image datasets and are computationally expensive to train on high-resolution images. In this paper, we propose a novel feature extraction module that reduces the quadratic complexity of computing global attention while enhancing the diversity of global features. Moreover, we extend the range of the original heatmap values and generate multiple outputs for each landmark position prediction. We integrate these components into a simple U-shaped model, and it achieves competitive detection accuracy without using any pretrained or additional processes compared to several recent methods. In addition, our experiments reveal that the Gaussian kernel size is a critical factor affecting model performance, a parameter that has not been extensively explored in the existing literature.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70056","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147649407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Reliability of Likelihoods From Conditional Flow Matching Generative Models Trained in Feature Space","authors":"Shane Josias, Willie Brink","doi":"10.1049/cvi2.70061","DOIUrl":"https://doi.org/10.1049/cvi2.70061","url":null,"abstract":"<p>Normalising flows are a flexible class of generative models that provide exact likelihoods and are often trained through maximum likelihood estimation. Recent work suggests that discrete-step flow models trained in this way can assign undesirably high likelihood to out-of-distribution image data, bringing their reliability for applications where likelihoods are important (e.g., outlier detection) into question. Continuous-time normalising flows trained with the conditional flow matching objective (CFM models) also provide unreliable likelihoods, and we investigate whether training them on various feature representations can lead to more reliable likelihoods. We consider features from a pretrained classifier, features from a pretrained perceptual autoencoder and features from an autoencoder trained from scratch with a simple pixel-based reconstruction loss, and compare their effects on CFM model likelihoods on various in- and out-of-distribution sets. Autoencoder-based features are of particular interest as the presence of a decoder preserves the ability to generate images. We find that training CFM models on feature representations can lead to improvements in likelihood reliability, but only for certain datasets and certain parameterisations of the feature space, at a cost in sample quality. Further investigation suggests possible links between likelihood reliability and geometric characteristics of the data and the feature space, and opens avenues for future work.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2026-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70061","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147320886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Double-Layer Graph Attention Networks for Parathyroid Detection","authors":"Wanling Liu, Wenhuan Lu, Qian Sun, Fei Chen, Jianguo Wei, Bo Wang, Wenxin Zhao","doi":"10.1049/cvi2.70058","DOIUrl":"https://doi.org/10.1049/cvi2.70058","url":null,"abstract":"<p>Due to the importance of the parathyroid glands (PG) for health, detecting and preserving them during endoscopic thyroid surgery is vital. However, existing parathyroid detection methods face issues from colour variations, target deformation, blurriness, and lighting effects in complex surgical environments. Therefore, they fail to extract high-quality features and perform poorly in our detection tasks. The essential reasons for these issues are two-fold: (A) ignoring the spatial relation among targets and (B) failing to identify the shapes, colours, and positions of targets, mainly due to insufficient utilisation of <i>depth information</i>, especially under lighting variations and occlusions, which are unavoidably prone to false or missed detection. To better discover and exploit the inter-target spatial relation (solving A), inspired by the power of graph neural networks on dependency modelling, we explore an effective graph-based framework specifically designed for PG detection. Specifically, we propose a novel double-layer graph attention network (i.e., <i>DL-GAT</i> for short), which explicitly facilitates <i>local augmentation</i> through the identification of key visual features (e.g., texture and shape) and global interactions. Besides, it also has merits in robustly combating image blur and better differentiating PG targets and background parts, thus improving detection precision. On the other hand, to solve B and better incorporate <i>depth knowledge</i>, we further propose a <i>depth augmentation</i> component, which can adaptively capture the intrinsic geometrical features of targets based on depth information and thus significantly improve light intensity robustness and naturally enhance generalisability. Moreover, because we lack a thyroid endoscopy surgery benchmark to evaluate and compare the performance of models for this task, we meticulously established a novel data set from 838 real surgeries performed (via the fully laparoscopic thoracic-breast approach) at the Fujian Medical University Union Hospital. Extensive experiments show that our framework achieves superior PG detection accuracy compared to its current state-of-the-art counterparts while maintaining real-time efficiency.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70058","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146224078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multi-Layer Convolutional Sparse Network for Pattern Classification Based on Sequential Dictionary Learning","authors":"Farhad Sadeghi Almalou, Farbod Razzazi, Arash Amini","doi":"10.1049/cvi2.70055","DOIUrl":"https://doi.org/10.1049/cvi2.70055","url":null,"abstract":"<p>Convolutional sparse coding (CSC) using learnt convolutional dictionaries has recently emerged as an effective technique for emphasising discriminative structures in signal and image processing applications. In this paper, we propose a multilayer model for convolutional sparse networks (CSNs), based on hierarchical convolutional sparse coding and dictionary learning, as a competitive alternative to conventional deep convolutional neural networks (CNNs). In the proposed CSN architecture, each layer learns a convolutional dictionary from the feature maps of the preceding layer (if available), and then uses it to extract sparse representations. This hierarchical process is repeated to obtain high-level feature maps in the final layer, suitable for pattern recognition and classification tasks. One key advantage of the CSN framework is its reduced sensitivity to training set size and its significantly lower computational complexity compared to CNNs. Experimental results on image classification tasks show that the proposed model achieves up to 7% higher accuracy than CNNs when trained with only 150 samples, while reducing computational cost by at least 50% under similar conditions.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70055","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}