IET Computer Vision最新文献

筛选
英文 中文
Enhancing semi-supervised contrastive learning through saliency map for diabetic retinopathy grading 通过显著性图增强糖尿病视网膜病变分级的半监督对比学习
IF 1.5 4区 计算机科学
IET Computer Vision Pub Date : 2024-08-26 DOI: 10.1049/cvi2.12308
Jiacheng Zhang, Rong Jin, Wenqiang Liu
{"title":"Enhancing semi-supervised contrastive learning through saliency map for diabetic retinopathy grading","authors":"Jiacheng Zhang,&nbsp;Rong Jin,&nbsp;Wenqiang Liu","doi":"10.1049/cvi2.12308","DOIUrl":"https://doi.org/10.1049/cvi2.12308","url":null,"abstract":"<p>Diabetic retinopathy (DR) is a severe ophthalmic condition that can lead to blindness if not diagnosed and provided timely treatment. Hence, the development of efficient automated DR grading systems is crucial for early screening and treatment. Although progress has been made in DR detection using deep learning techniques, these methods still face challenges in handling the complexity of DR lesion characteristics and the nuances in grading criteria. Moreover, the performance of these algorithms is hampered by the scarcity of large-scale, high-quality annotated data. An innovative semi-supervised fundus image DR grading framework is proposed, employing a saliency estimation map to bolster the model's perception of fundus structures, thereby improving the differentiation between lesions and healthy regions. By integrating semi-supervised and contrastive learning, the model's ability to recognise inter-class and intra-class variations in DR grading is enhanced, allowing for precise discrimination of various lesion features. Experiments conducted on publicly available DR grading datasets, such as EyePACS and Messidor, have validated the effectiveness of our proposed method. Specifically, our approach outperforms the state of the art on the kappa metric by 0.8% on the full EyePACS dataset and by 3.2% on a 10% subset of EyePACS, demonstrating its superiority over previous methodologies. The authors’ code is publicly available at https://github.com/500ZhangJC/SCL-SEM-framework-for-DR-Grading.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1127-1137"},"PeriodicalIF":1.5,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12308","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Balanced parametric body prior for implicit clothed human reconstruction from a monocular RGB 根据单目 RGB 重建隐式衣着人体的平衡参数人体先验图
IF 1.5 4区 计算机科学
IET Computer Vision Pub Date : 2024-08-25 DOI: 10.1049/cvi2.12306
Rong Xue, Jiefeng Li, Cewu Lu
{"title":"Balanced parametric body prior for implicit clothed human reconstruction from a monocular RGB","authors":"Rong Xue,&nbsp;Jiefeng Li,&nbsp;Cewu Lu","doi":"10.1049/cvi2.12306","DOIUrl":"https://doi.org/10.1049/cvi2.12306","url":null,"abstract":"<p>The authors study the problem of reconstructing detailed 3D human surfaces in various poses and clothing from images. The parametric human body allows accurate 3D clothed human reconstruction. However, the offset of large and loose clothing from the inferred parametric body mesh confines the generalisation of the existing parametric body-based methods. A distinctive method that simultaneously generalises well to unseen poses and unseen clothing is proposed. The authors first discover the unbalanced nature of existing implicit function-based methods. To address this issue, the authors propose to synthesise the balanced training samples with a new dependency coefficient in training. The dependency coefficient can tell the network whether the prior from the parametric body model is reliable. The authors then design a novel positional embedding-based attenuation strategy to incorporate the dependency coefficient into the implicit function (IF) network. Comprehensive experiments are conducted on the CAPE dataset to study the effectiveness of the authors’ approach. The proposed method significantly surpasses state-of-the-art approaches and generalises well on unseen poses and clothing. As an illustrative example, the proposed method improves the Chamfer Distance Error and Normal Error by 38.2% and 57.6%.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1057-1067"},"PeriodicalIF":1.5,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12306","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Social-ATPGNN: Prediction of multi-modal pedestrian trajectory of non-homogeneous social interaction 社交-ATPGNN:预测非同质社交互动的多模式行人轨迹
IF 1.5 4区 计算机科学
IET Computer Vision Pub Date : 2024-08-21 DOI: 10.1049/cvi2.12286
Kehao Wang, Han Zou
{"title":"Social-ATPGNN: Prediction of multi-modal pedestrian trajectory of non-homogeneous social interaction","authors":"Kehao Wang,&nbsp;Han Zou","doi":"10.1049/cvi2.12286","DOIUrl":"https://doi.org/10.1049/cvi2.12286","url":null,"abstract":"<p>With the development of automatic driving and path planning technology, predicting the moving trajectory of pedestrians in dynamic scenes has become one of key and urgent technical problems. However, most of the existing techniques regard all pedestrians in the scene as equally important influence on the predicted pedestrian's trajectory, and the existing methods which use sequence-based time-series generative models to obtain the predicted trajectories, do not allow for parallel computation, it will introduce a significant computational overhead. A new social trajectory prediction network, Social-ATPGNN which integrates both temporal information and spatial one based on ATPGNN is proposed. In space domain, the pedestrians in the predicted scene are formed into an undirected and non fully connected graph, which solves the problem of homogenisation of pedestrian relationships, then, the spatial interaction between pedestrians is encoded to improve the accuracy of modelling pedestrian social consciousness. After acquiring high-level spatial data, the method uses Temporal Convolutional Network which could perform parallel calculations to capture the correlation of time series of pedestrian trajectories. Through a large number of experiments, the proposed model shows the superiority over the latest models on various pedestrian trajectory datasets.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"907-921"},"PeriodicalIF":1.5,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12286","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HIST: Hierarchical and sequential transformer for image captioning HIST:用于图像标题的分层和顺序变换器
IF 1.5 4区 计算机科学
IET Computer Vision Pub Date : 2024-08-15 DOI: 10.1049/cvi2.12305
Feixiao Lv, Rui Wang, Lihua Jing, Pengwen Dai
{"title":"HIST: Hierarchical and sequential transformer for image captioning","authors":"Feixiao Lv,&nbsp;Rui Wang,&nbsp;Lihua Jing,&nbsp;Pengwen Dai","doi":"10.1049/cvi2.12305","DOIUrl":"https://doi.org/10.1049/cvi2.12305","url":null,"abstract":"<p>Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder–decoder transformer framework. Such transformer structures, however, show two main limitations in the task of image captioning. Firstly, the traditional transformer obtains high-level fusion features to decode while ignoring other-level features, resulting in losses of image content. Secondly, the transformer is weak in modelling the natural order characteristics of language. To address theseissues, the authors propose a <b>HI</b>erarchical and <b>S</b>equential <b>T</b>ransformer (<b>HIST</b>) structure, which forces each layer of the encoder and decoder to focus on features of different granularities, and strengthen the sequentially semantic information. Specifically, to capture the details of different levels of features in the image, the authors combine the visual features of multiple regions and divide them into multiple levels differently. In addition, to enhance the sequential information, the sequential enhancement module in each decoder layer block extracts different levels of features for sequentially semantic extraction and expression. Extensive experiments on the public datasets MS-COCO and Flickr30k have demonstrated the effectiveness of our proposed method, and show that the authors’ method outperforms most of previous state of the arts.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1043-1056"},"PeriodicalIF":1.5,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12305","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-modal video search by examples—A video quality impact analysis 通过实例进行多模式视频搜索--视频质量影响分析
IF 1.5 4区 计算机科学
IET Computer Vision Pub Date : 2024-07-27 DOI: 10.1049/cvi2.12303
Guanfeng Wu, Abbas Haider, Xing Tian, Erfan Loweimi, Chi Ho Chan, Mengjie Qian, Awan Muhammad, Ivor Spence, Rob Cooper, Wing W. Y. Ng, Josef Kittler, Mark Gales, Hui Wang
{"title":"Multi-modal video search by examples—A video quality impact analysis","authors":"Guanfeng Wu,&nbsp;Abbas Haider,&nbsp;Xing Tian,&nbsp;Erfan Loweimi,&nbsp;Chi Ho Chan,&nbsp;Mengjie Qian,&nbsp;Awan Muhammad,&nbsp;Ivor Spence,&nbsp;Rob Cooper,&nbsp;Wing W. Y. Ng,&nbsp;Josef Kittler,&nbsp;Mark Gales,&nbsp;Hui Wang","doi":"10.1049/cvi2.12303","DOIUrl":"10.1049/cvi2.12303","url":null,"abstract":"<p>As the proliferation of video content continues, and many video archives lack suitable metadata, therefore, video retrieval, particularly through example-based search, has become increasingly crucial. Existing metadata often fails to meet the needs of specific types of searches, especially when videos contain elements from different modalities, such as visual and audio. Consequently, developing video retrieval methods that can handle multi-modal content is essential. An innovative Multi-modal Video Search by Examples (MVSE) framework is introduced, employing state-of-the-art techniques in its various components. In designing MVSE, the authors focused on accuracy, efficiency, interactivity, and extensibility, with key components including advanced data processing and a user-friendly interface aimed at enhancing search effectiveness and user experience. Furthermore, the framework was comprehensively evaluated, assessing individual components, data quality issues, and overall retrieval performance using high-quality and low-quality BBC archive videos. The evaluation reveals that: (1) multi-modal search yields better results than single-modal search; (2) the quality of video, both visual and audio, has an impact on the query precision. Compared with image query results, audio quality has a greater impact on the query precision (3) a two-stage search process (i.e. searching by Hamming distance based on hashing, followed by searching by Cosine similarity based on embedding); is effective but increases time overhead; (4) large-scale video retrieval is not only feasible but also expected to emerge shortly.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1017-1033"},"PeriodicalIF":1.5,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12303","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141798043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
2D human skeleton action recognition with spatial constraints 带空间约束的二维人体骨骼动作识别
IF 1.5 4区 计算机科学
IET Computer Vision Pub Date : 2024-07-11 DOI: 10.1049/cvi2.12296
Lei Wang, Jianwei Zhang, Wenbing Yang, Song Gu, Shanmin Yang
{"title":"2D human skeleton action recognition with spatial constraints","authors":"Lei Wang,&nbsp;Jianwei Zhang,&nbsp;Wenbing Yang,&nbsp;Song Gu,&nbsp;Shanmin Yang","doi":"10.1049/cvi2.12296","DOIUrl":"10.1049/cvi2.12296","url":null,"abstract":"<p>Human actions are predominantly presented in 2D format in video surveillance scenarios, which hinders the accurate determination of action details not apparent in 2D data. Depth estimation can aid human action recognition tasks, enhancing accuracy with neural networks. However, reliance on images for depth estimation requires extensive computational resources and cannot utilise the connectivity between human body structures. Besides, the depth information may not accurately reflect actual depth ranges, necessitating improved reliability. Therefore, a 2D human skeleton action recognition method with spatial constraints (2D-SCHAR) is introduced. 2D-SCHAR employs graph convolution networks to process graph-structured human action skeleton data comprising three parts: depth estimation, spatial transformation, and action recognition. The initial two components, which infer 3D information from 2D human skeleton actions and generate spatial transformation parameters to correct abnormal deviations in action data, support the latter in the model to enhance the accuracy of action recognition. The model is designed in an end-to-end, multitasking manner, allowing parameter sharing among these three components to boost performance. The experimental results validate the model's effectiveness and superiority in human skeleton action recognition.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"968-981"},"PeriodicalIF":1.5,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12296","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141657484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Centre-loss—A preferred class verification approach over sample-to-sample in self-checkout products datasets 中心损失--自助结账产品数据集中优于样本到样本的类别验证方法
IF 1.5 4区 计算机科学
IET Computer Vision Pub Date : 2024-07-11 DOI: 10.1049/cvi2.12302
Bernardas Ciapas, Povilas Treigys
{"title":"Centre-loss—A preferred class verification approach over sample-to-sample in self-checkout products datasets","authors":"Bernardas Ciapas,&nbsp;Povilas Treigys","doi":"10.1049/cvi2.12302","DOIUrl":"10.1049/cvi2.12302","url":null,"abstract":"<p>Siamese networks excel at comparing two images, serving as an effective class verification technique for a single-per-class reference image. However, when multiple reference images are present, Siamese verification necessitates multiple comparisons and aggregation, often unpractical at inference. The Centre-Loss approach, proposed in this research, solves a class verification task more efficiently, using a single forward-pass during inference, than sample-to-sample approaches. Optimising a Centre-Loss function learns class centres and minimises intra-class distances in latent space. The authors compared verification accuracy using Centre-Loss against aggregated Siamese when other hyperparameters (such as neural network backbone and distance type) are the same. Experiments were performed to contrast the ubiquitous Euclidean against other distance types to discover the optimum Centre-Loss layer, its size, and Centre-Loss weight. In optimal architecture, the Centre-Loss layer is connected to the penultimate layer, calculates Euclidean distance, and its size depends on distance type. The Centre-Loss method was validated on the Self-Checkout products and Fruits 360 image datasets. Centre-Loss comparable accuracy and lesser complexity make it a preferred approach over sample-to-sample for the class verification task, when the number of reference image per class is high and inference speed is a factor, such as in self-checkouts.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1004-1016"},"PeriodicalIF":1.5,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12302","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141657814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GR-Former: Graph-reinforcement transformer for skeleton-based driver action recognition GR-Former:基于骨架的驾驶员动作识别图形强化变换器
IF 1.5 4区 计算机科学
IET Computer Vision Pub Date : 2024-07-10 DOI: 10.1049/cvi2.12298
Zhuoyan Xu, Jingke Xu
{"title":"GR-Former: Graph-reinforcement transformer for skeleton-based driver action recognition","authors":"Zhuoyan Xu,&nbsp;Jingke Xu","doi":"10.1049/cvi2.12298","DOIUrl":"10.1049/cvi2.12298","url":null,"abstract":"<p>In in-vehicle driving scenarios, composite action recognition is crucial for improving safety and understanding the driver's intention. Due to spatial constraints and occlusion factors, the driver's range of motion is limited, thus resulting in similar action patterns that are difficult to differentiate. Additionally, collecting skeleton data that characterise the full human posture is difficult, posing additional challenges for action recognition. To address the problems, a novel Graph-Reinforcement Transformer (GR-Former) model is proposed. Using limited skeleton data as inputs, by introducing graph structure information to directionally reinforce the effect of the self-attention mechanism, dynamically learning and aggregating features between joints at multiple levels, the authors’ model constructs a richer feature vector space, enhancing its expressiveness and recognition accuracy. Based on the Drive &amp; Act dataset for composite action recognition, the authors’ work only applies human upper-body skeleton data to achieve state-of-the-art performance compared to existing methods. Using complete human skeleton data also has excellent recognition accuracy on the NTU RGB + D- and NTU RGB + D 120 dataset, demonstrating the great generalisability of the GR-Former. Generally, the authors’ work provides a new and effective solution for driver action recognition in in-vehicle scenarios.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"982-991"},"PeriodicalIF":1.5,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12298","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141659905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-scale skeleton simplification graph convolutional network for skeleton-based action recognition 基于骨骼动作识别的多尺度骨骼简化图卷积网络
IF 1.5 4区 计算机科学
IET Computer Vision Pub Date : 2024-07-08 DOI: 10.1049/cvi2.12300
Fan Zhang, Ding Chongyang, Kai Liu, Liu Hongjin
{"title":"Multi-scale skeleton simplification graph convolutional network for skeleton-based action recognition","authors":"Fan Zhang,&nbsp;Ding Chongyang,&nbsp;Kai Liu,&nbsp;Liu Hongjin","doi":"10.1049/cvi2.12300","DOIUrl":"10.1049/cvi2.12300","url":null,"abstract":"<p>Human action recognition based on graph convolutional networks (GCNs) is one of the hotspots in computer vision. However, previous methods generally rely on handcrafted graph, which limits the effectiveness of the model in characterising the connections between indirectly connected joints. The limitation leads to weakened connections when joints are separated by long distances. To address the above issue, the authors propose a skeleton simplification method which aims to reduce the number of joints and the distance between joints by merging adjacent joints into simplified joints. Group convolutional block is devised to extract the internal features of the simplified joints. Additionally, the authors enhance the method by introducing multi-scale modelling, which maps inputs into sequences across various levels of simplification. Combining with spatial temporal graph convolution, a multi-scale skeleton simplification GCN for skeleton-based action recognition (M3S-GCN) is proposed for fusing multi-scale skeleton sequences and modelling the connections between joints. Finally, M3S-GCN is evaluated on five benchmarks of NTU RGB+D 60 (C-Sub, C-View), NTU RGB+D 120 (X-Sub, X-Set) and NW-UCLA datasets. Experimental results show that the authors’ M3S-GCN achieves state-of-the-art performance with the accuracies of 93.0%, 97.0% and 91.2% on C-Sub, C-View and X-Set benchmarks, which validates the effectiveness of the method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"992-1003"},"PeriodicalIF":1.5,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12300","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141668289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Recognition of European mammals and birds in camera trap images using deep neural networks 利用深度神经网络识别相机捕获图像中的欧洲哺乳动物和鸟类
IF 1.5 4区 计算机科学
IET Computer Vision Pub Date : 2024-07-03 DOI: 10.1049/cvi2.12294
Daniel Schneider, Kim Lindner, Markus Vogelbacher, Hicham Bellafkir, Nina Farwig, Bernd Freisleben
{"title":"Recognition of European mammals and birds in camera trap images using deep neural networks","authors":"Daniel Schneider,&nbsp;Kim Lindner,&nbsp;Markus Vogelbacher,&nbsp;Hicham Bellafkir,&nbsp;Nina Farwig,&nbsp;Bernd Freisleben","doi":"10.1049/cvi2.12294","DOIUrl":"10.1049/cvi2.12294","url":null,"abstract":"<p>Most machine learning methods for animal recognition in camera trap images are limited to mammal identification and group birds into a single class. Machine learning methods for visually discriminating birds, in turn, cannot discriminate between mammals and are not designed for camera trap images. The authors present deep neural network models to recognise both mammals and bird species in camera trap images. They train neural network models for species classification as well as for predicting the animal taxonomy, that is, genus, family, order, group, and class names. Different neural network architectures, including ResNet, EfficientNetV2, Vision Transformer, Swin Transformer, and ConvNeXt, are compared for these tasks. Furthermore, the authors investigate approaches to overcome various challenges associated with camera trap image analysis. The authors’ best species classification models achieve a mean average precision (mAP) of 97.91% on a validation data set and mAPs of 90.39% and 82.77% on test data sets recorded in forests in Germany and Poland, respectively. Their best taxonomic classification models reach a validation mAP of 97.18% and mAPs of 94.23% and 79.92% on the two test data sets, respectively.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1162-1192"},"PeriodicalIF":1.5,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12294","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141683177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信