IET Computer Vision最新文献

筛选
英文 中文
ShipsMOT: A Comprehensive Benchmark and Framework for Multiobject Tracking of Ships ShipsMOT:船舶多目标跟踪的综合基准和框架
IF 1.3 4区 计算机科学
IET Computer Vision Pub Date : 2025-10-02 DOI: 10.1049/cvi2.70042
Fang Luo, Pengju Jiang, George To Sum Ho, Wenjing Zeng
{"title":"ShipsMOT: A Comprehensive Benchmark and Framework for Multiobject Tracking of Ships","authors":"Fang Luo,&nbsp;Pengju Jiang,&nbsp;George To Sum Ho,&nbsp;Wenjing Zeng","doi":"10.1049/cvi2.70042","DOIUrl":"https://doi.org/10.1049/cvi2.70042","url":null,"abstract":"<p>Multiobject tracking of ships is crucial for various applications, such as maritime security and the development of ship autopilot systems. However, existing ship visual datasets primarily focus on ship detection tasks, lacking a fully open-source dataset for multiobject tracking research. Furthermore, current methods often struggle with extracting appearance features under complex sea conditions, varying scales and different ship types, affecting tracking precision. To address these issues, we propose ShipsMOT, a new benchmark dataset containing 121 video sequences with an average of 15.45 s per sequence, covering 15 distinct ship types and a total of 237,999 annotated bounding boxes. Additionally, we propose JDR-CSTrack, a ship multiobject tracking framework that improves feature extraction at different scales by optimising a joint detection and Re-ID network. JDR-CSTrack utilises the fusion of appearance and motion features for multilevel data association, thereby minimising track loss and ID switches. Experimental results confirm that ShipsMOT can serve as a benchmark for future research in ship multiobject tracking and validate the superiority of the proposed JDR-CSTrack framework. The dataset and code can be found on https://github.com/jpj0916/ShipsMOT.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145223773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EIRN: A Method for Emotion Recognition Based on Micro-Expressions 基于微表情的情感识别方法
IF 1.3 4区 计算机科学
IET Computer Vision Pub Date : 2025-09-26 DOI: 10.1049/cvi2.70044
Genlang Chen, Han Zhou, Yufeng Chen, Jiajian Zhang, Wenwen Shen
{"title":"EIRN: A Method for Emotion Recognition Based on Micro-Expressions","authors":"Genlang Chen,&nbsp;Han Zhou,&nbsp;Yufeng Chen,&nbsp;Jiajian Zhang,&nbsp;Wenwen Shen","doi":"10.1049/cvi2.70044","DOIUrl":"https://doi.org/10.1049/cvi2.70044","url":null,"abstract":"<p>Micro-expressions are involuntary facial movements that reveal a person's true emotions when attempting to conceal them. These expressions hold significant potential for various applications. However, due to their brief duration and subtle manifestation, detailed features are often obscured by redundant information, making micro-expression recognition challenging. Previous studies have primarily relied on convolutional neural networks (CNNs) to process high-resolution images or optical flow features, but the complexity of deep networks often introduces redundancy and leads to overfitting. In this paper, we propose EIRN, a novel method for micro-expression recognition. Unlike conventional approaches, EIRN explicitly separates facial features of different granularities, using shallow networks to extract sparse features from low-resolution greyscale images, while treating onset–apex pairs as Siamese samples and employing a Siamese neural network (SNN) to extract dense features from high-resolution counterparts. These multigranularity features are then integrated for accurate classification. To mitigate overfitting in fine-grained feature extraction by the SNN, we introduce an attention module tailored to enhance crucial feature representation from both onset and apex frames during training. Experimental results on single and composite datasets demonstrate the effectiveness of our approach and its potential for real-world applications.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70044","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145146798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Frequency Domain Adaptive Filters in Vision Transformers for Small-Scale Datasets 小尺度数据集视觉变压器的频域自适应滤波器
IF 1.3 4区 计算机科学
IET Computer Vision Pub Date : 2025-09-18 DOI: 10.1049/cvi2.70043
Oscar Ondeng, Peter Akuon, Heywood Ouma
{"title":"Frequency Domain Adaptive Filters in Vision Transformers for Small-Scale Datasets","authors":"Oscar Ondeng,&nbsp;Peter Akuon,&nbsp;Heywood Ouma","doi":"10.1049/cvi2.70043","DOIUrl":"10.1049/cvi2.70043","url":null,"abstract":"<p>Transformers have achieved remarkable success in computer vision, but their reliance on self-attention mechanisms poses challenges for small-scale datasets due to high computational demands and data requirements. This paper introduces the Multi-Head Adaptive Filter Frequency Vision Transformer (MAF-FViT), a Vision Transformer model that replaces self-attention with frequency-domain adaptive filters. MAF-FViT leverages multi-head adaptive filtering in the frequency domain to capture essential features with reduced computational complexity, providing an efficient alternative for vision tasks on limited data. Training is carried out from scratch without the need for pretraining on large-scale datasets. The proposed MAF-FViT model demonstrates strong performance on various image classification tasks, achieving competitive accuracy with a lower parameter count and faster processing times compared to self-attention-based models and other models employing alternative token mixers. The multi-head adaptive filters enable the model to capture complex image features effectively, preserving high classification accuracy while minimising computational load. The results demonstrate that frequency-domain adaptive filters offer an effective alternative to self-attention, enabling competitive performance on small-scale datasets while reducing training time and memory requirements. MAF-FViT opens avenues for resource-efficient transformer models in vision applications, making it a promising solution for settings constrained by data or computational resources.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70043","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145101611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Model-Based Spatio-Temporal Semantic Enhancement for Skeleton Action Understanding 基于大语言模型的骨架动作理解时空语义增强
IF 1.3 4区 计算机科学
IET Computer Vision Pub Date : 2025-09-14 DOI: 10.1049/cvi2.70041
Ran Wei, Hui Jie Zhang, Chang Cao, Fang Zhang, Jun Ling Gao, Xiao Tian Li, Lei Geng
{"title":"Large Language Model-Based Spatio-Temporal Semantic Enhancement for Skeleton Action Understanding","authors":"Ran Wei,&nbsp;Hui Jie Zhang,&nbsp;Chang Cao,&nbsp;Fang Zhang,&nbsp;Jun Ling Gao,&nbsp;Xiao Tian Li,&nbsp;Lei Geng","doi":"10.1049/cvi2.70041","DOIUrl":"10.1049/cvi2.70041","url":null,"abstract":"<p>Skeleton-based temporal action segmentation aims to segment and classify human actions in untrimmed skeletal sequences. Existing methods struggle with distinguishing transition poses between adjacent frames and fail to adequately capture semantic dependencies between joints and actions. To address these challenges, we propose a large language model-based spatio-temporal semantic enhancement (LLM-STSE) method, a novel framework that combines adaptive spatio-temporal axial attention (ASTA-Attention) and dynamic semantic-guided multimodal action segmentation (DSG-MAS). ASTA-Attention models spatial and temporal dependencies using axial attention, whereas DSG-MAS dynamically generates semantic prompts based on joint motion and fuses them with skeleton features for more accurate segmentation. Experiments on MCFS and PKU-MMD datasets show that LLM-STSE achieves state-of-the-art performance, significantly improving action segmentation, especially in complex transitions, with substantial F1 score gains across multiple public datasets.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70041","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145062633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Interpretability of NesT Model Using NesT-Shapley and Feature-Weight-Augmentation Method 利用NesT- shapley和特征权重增强方法增强NesT模型的可解释性
IF 1.3 4区 计算机科学
IET Computer Vision Pub Date : 2025-09-06 DOI: 10.1049/cvi2.70039
Li Xu, Lei Li, Xiaohong Cong, Huijie Song
{"title":"Enhancing Interpretability of NesT Model Using NesT-Shapley and Feature-Weight-Augmentation Method","authors":"Li Xu,&nbsp;Lei Li,&nbsp;Xiaohong Cong,&nbsp;Huijie Song","doi":"10.1049/cvi2.70039","DOIUrl":"10.1049/cvi2.70039","url":null,"abstract":"<p>The transformer's capabilities in natural language processing and computer vision are impressive, but interpretability is crucial in specific domain applications. The NesT model, with its pyramidal structure, demonstrates high accuracy and faster training speeds. Unlike other models, a unique aspect of NesT is its avoidance of the [CLS] token, which presents challenges when applying interpretability methods that rely on the model's internal structure. Instead, NesT divides the image into 16 blocks and processes them using 16 independent vision transformers. We propose the NesT-Shapley method, which utilises this structure to combine the Shapley value method (a self-interpretable approach) with the independently operating vision transformers within NesT, significantly reducing computational complexity. On the other hand, we introduced the feature weight augmentation (FWA) method to address the challenges of weight adjustment in the final interpretability results produced by interpretability methods without [CLS] token, markedly enhancing the performance of interpretability methods and providing a better understanding of the information flow during the prediction process in the NesT model. We conducted perturbation experiments on the NesT model using the ImageNet and CIFAR-100 datasets and segmentation experiments on the ImageNet-Segmentation dataset, achieving impressive experimental results.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144998675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-Prompting Segment Anything Model for Few-Shot Medical Image Segmentation 基于自提示分割模型的医学图像分割
IF 1.3 4区 计算机科学
IET Computer Vision Pub Date : 2025-08-23 DOI: 10.1049/cvi2.70040
Haifeng Zhao, Weichen Liu, Leilei Ma, Zaipeng Xie
{"title":"Self-Prompting Segment Anything Model for Few-Shot Medical Image Segmentation","authors":"Haifeng Zhao,&nbsp;Weichen Liu,&nbsp;Leilei Ma,&nbsp;Zaipeng Xie","doi":"10.1049/cvi2.70040","DOIUrl":"10.1049/cvi2.70040","url":null,"abstract":"<p>Segmenting unlabelled medical images with a minimal amount of labelled data is a daunting task due to the complex feature landscapes and the prevalent noise and artefacts characteristic of medical imaging processes. The SAM has showcased the potential of large-scale image segmentation models for achieving zero-shot generalisation across previously unseen objects. However, directly applying SAM to medical image segmentation without incorporating prior knowledge of the target task can lead to unsatisfactory results. To address this, we enhance SAM by integrating prior knowledge of medical image segmentation tasks. This enables it to quickly adapt to few-shot medical image segmentation tasks while ensuring efficient parameter training. Our method employs an ensemble learning strategy to train a simple classifier, producing a coarse mask for each test image. Importantly, this coarse mask generates more accurate prompt points and boxes, thus improving SAM's capacity for prompt-driven segmentation. Furthermore, to refine SAM's ability to produce more precise masks, we introduce the Isolated Noise Removal (INR) module, which efficiently removes noise from the coarse masks. In addition, our novel Multi-point Automatic Prompt (MPAP) module is designed to independently generate multiple effective and evenly distributed point prompts based on these coarse masks. Additionally, we introduce an innovative knee joint dataset benchmark specifically for medical image segmentation, contributing further to the research field. Extensive evaluations on three benchmark datasets confirm the superior performance of our approach compared to existing methods, demonstrating its efficacy and significant progress in the domain of few-shot medical image segmentation.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70040","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144892495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards More Generalisable Compositional Feature Learning in Human-Object Interaction Detection 面向人-物交互检测中更通用的组合特征学习
IF 1.3 4区 计算机科学
IET Computer Vision Pub Date : 2025-08-11 DOI: 10.1049/cvi2.70037
Shuang Liang, Zikun Zhuang, Chi Xie, Shuwei Yan, Hongming Zhu
{"title":"Towards More Generalisable Compositional Feature Learning in Human-Object Interaction Detection","authors":"Shuang Liang,&nbsp;Zikun Zhuang,&nbsp;Chi Xie,&nbsp;Shuwei Yan,&nbsp;Hongming Zhu","doi":"10.1049/cvi2.70037","DOIUrl":"10.1049/cvi2.70037","url":null,"abstract":"<p>The long-tailed distribution of training samples is a fundamental challenge in human-object interaction (HOI) detection, leading to extremely imbalanced performance on non-rare and rare classes. Existing works adopt the idea of compositional learning, in which object and action features are learnt individually and re-composed into new samples of rare HOI classes. However, most of these methods are proposed on traditional CNN-based frameworks which are weak in capturing image-wide context. Moreover, the simple feature integration mechanisms fail to aggregate effective semantics in re-composed features. As a result, these methods achieve only limited improvements on knowledge generalisation. We propose a novel transformer-based compositional learning framework for HOI detection. Human-object pair features and interaction features containing rich global context are extracted, and comprehensively integrated via the cross-attention mechanism, generating re-composed features containing more generalisable semantics. To further improve re-composed features and promote knowledge generalisation, we leverage the vision-language model CLIP in a computation-efficient manner to improve re-composition sampling and guide the interaction feature learning. Experiments on two benchmark datasets prove the effectiveness of our method in improving performance on both rare and non-rare HOI classes.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70037","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144811336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MVDT: Multiview Distillation Transformer for View-Invariant Sign Language Translation 面向视点不变手语翻译的多视点蒸馏转换器
IF 1.3 4区 计算机科学
IET Computer Vision Pub Date : 2025-07-31 DOI: 10.1049/cvi2.70038
Zhong Guan, Yongli Hu, Huajie Jiang, Yanfeng Sun, Baocai Yin
{"title":"MVDT: Multiview Distillation Transformer for View-Invariant Sign Language Translation","authors":"Zhong Guan,&nbsp;Yongli Hu,&nbsp;Huajie Jiang,&nbsp;Yanfeng Sun,&nbsp;Baocai Yin","doi":"10.1049/cvi2.70038","DOIUrl":"10.1049/cvi2.70038","url":null,"abstract":"<p>Sign language translation based on machine learning plays a crucial role in facilitating communication between deaf and hearing individuals. However, due to the complexity and variability of sign language, coupled with limited observation angles, single-view sign language translation models often underperform in real-world applications. Although some studies have attempted to improve translation efficiency by incorporating multiview data, challenges, such as feature alignment, fusion, and the high cost of capturing multiview data, remain significant barriers in many practical scenarios. To address these issues, we propose a multiview distillation transformer model (MVDT) for continuous sign language translation. The MVDT introduces a novel distillation mechanism, where a teacher model is designed to learn common features from multiview data, subsequently guiding a student model to extract view-invariant features using only single-view input. To evaluate the proposed method, we construct a multiview sign language dataset comprising five distinct views and conduct extensive experiments comparing the MVDT with state-of-the-art methods. Experimental results demonstrate that the proposed model exhibits superior view-invariant translation capabilities across different views.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive Multiscale Attention Feature Aggregation for Multi-Modal 3D Occluded Object Detection 多模态三维遮挡目标检测的自适应多尺度注意特征聚合
IF 1.3 4区 计算机科学
IET Computer Vision Pub Date : 2025-07-17 DOI: 10.1049/cvi2.70035
Yanfeng Han, Ming Yu, Jing Liu
{"title":"Adaptive Multiscale Attention Feature Aggregation for Multi-Modal 3D Occluded Object Detection","authors":"Yanfeng Han,&nbsp;Ming Yu,&nbsp;Jing Liu","doi":"10.1049/cvi2.70035","DOIUrl":"10.1049/cvi2.70035","url":null,"abstract":"<p>Accurate perception and understanding of the three-dimensional environment is crucial for autonomous vehicles to navigate efficiently and make wise decisions. However, in complex real-world scenarios, the information obtained by a single-modal sensor is often incomplete, severely affecting the detection accuracy of occluded targets. To address this issue, this paper proposes a novel adaptive multi-scale attention aggregation strategy, efficiently fusing multi-scale feature representations of heterogeneous data to accurately capture the shape details and spatial relationships of targets in three-dimensional space. This strategy utilises learnable sparse keypoints to dynamically align heterogeneous features in a data-driven manner, adaptively modelling the cross-modal mapping relationships between keypoints and their corresponding multi-scale image features. Given the importance of accurately obtaining the three-dimensional shape information of targets for understanding the size and rotation pose of occluded targets, this paper adopts a shape prior knowledge-based constraint method and data augmentation strategy to guide the model to more accurately perceive the complete three-dimensional shape and rotation pose of occluded targets. Experimental results show that our proposed model achieves 2.15%, 3.24% and 2.75% improvement in 3D<sub>R40</sub> mAP score under the easy, moderate and hard difficulty levels compared to MVXNet, significantly enhancing the detection accuracy and robustness of occluded targets in complex scenarios.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70035","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144647572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds From RGB Images for 2D Classification SIM-Net:利用RGB图像中推断的3D物体形状点云进行2D分类的多模态融合网络
IF 1.3 4区 计算机科学
IET Computer Vision Pub Date : 2025-07-09 DOI: 10.1049/cvi2.70036
Youcef Sklab, Hanane Ariouat, Eric Chenin, Edi Prifti, Jean-Daniel Zucker
{"title":"SIM-Net: A Multimodal Fusion Network Using Inferred 3D Object Shape Point Clouds From RGB Images for 2D Classification","authors":"Youcef Sklab,&nbsp;Hanane Ariouat,&nbsp;Eric Chenin,&nbsp;Edi Prifti,&nbsp;Jean-Daniel Zucker","doi":"10.1049/cvi2.70036","DOIUrl":"10.1049/cvi2.70036","url":null,"abstract":"<p>We introduce the shape-image multimodal network (SIM-Net), a novel 2D image classification architecture that integrates 3D point cloud representations inferred directly from RGB images. Our key contribution lies in a pixel-to-point transformation that converts 2D object masks into 3D point clouds, enabling the fusion of texture-based and geometric features for enhanced classification performance. SIM-Net is particularly well-suited for the classification of digitised herbarium specimens—a task made challenging by heterogeneous backgrounds, nonplant elements, and occlusions that compromise conventional image-based models. To address these issues, SIM-Net employs a segmentation-based preprocessing step to extract object masks prior to 3D point cloud generation. The architecture comprises a CNN encoder for 2D image features and a PointNet-based encoder for geometric features, which are fused into a unified latent space. Experimental evaluations on herbarium datasets demonstrate that SIM-Net consistently outperforms ResNet101, achieving gains of up to 9.9% in accuracy and 12.3% in F-score. It also surpasses several transformer-based state-of-the-art architectures, highlighting the benefits of incorporating 3D structural reasoning into 2D image classification tasks.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70036","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信