Image and Vision Computing最新文献

筛选
英文 中文
FGS-NeRF: A fast glossy surface reconstruction method based on voxel and reflection directions
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-14 DOI: 10.1016/j.imavis.2025.105455
Han Hong , Qing Ye , Keyun Xiong , Qing Tao , Yiqian Wan
{"title":"FGS-NeRF: A fast glossy surface reconstruction method based on voxel and reflection directions","authors":"Han Hong ,&nbsp;Qing Ye ,&nbsp;Keyun Xiong ,&nbsp;Qing Tao ,&nbsp;Yiqian Wan","doi":"10.1016/j.imavis.2025.105455","DOIUrl":"10.1016/j.imavis.2025.105455","url":null,"abstract":"<div><div>Neural surface reconstruction technology has great potential for recovering 3D surfaces from multiview images. However, surface gloss can severely affect the reconstruction quality. Although existing methods address the issue of glossy surface reconstruction, achieving rapid reconstruction remains a challenge. While DVGO can achieve rapid scene geometry search, it tends to create numerous holes in glossy surfaces during the search process. To address this, we design a geometry search method based on SDF and reflection directions, employing a method called progressive voxel-MLP scaling to achieve accurate and efficient geometry searches for glossy scenes. To mitigate object edge artifacts caused by reflection directions, we use a simple loss function called sigmoid RGB loss, which helps reduce artifacts around objects during the early stages of training and promotes efficient surface convergence. In this work, we introduce the FGS-NeRF model, which uses a coarse-to-fine training method combined with reflection directions to achieve rapid reconstruction of glossy object surfaces based on voxel grids. The training time on a single RTX 4080 GPU is 20 min. Evaluations on the Shiny Blender and Smart Car datasets confirm that our model significantly improves the speed when compared with existing glossy object reconstruction methods while achieving accurate object surfaces. Code: <span><span>https://github.com/yosugahhh/FGS-nerf</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105455"},"PeriodicalIF":4.2,"publicationDate":"2025-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143445919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ESDA: Zero-shot semantic segmentation based on an embedding semantic space distribution adjustment strategy
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-13 DOI: 10.1016/j.imavis.2025.105456
Jiaguang Li, Ying Wei, Wei Zhang, Chuyuan Wang
{"title":"ESDA: Zero-shot semantic segmentation based on an embedding semantic space distribution adjustment strategy","authors":"Jiaguang Li,&nbsp;Ying Wei,&nbsp;Wei Zhang,&nbsp;Chuyuan Wang","doi":"10.1016/j.imavis.2025.105456","DOIUrl":"10.1016/j.imavis.2025.105456","url":null,"abstract":"<div><div>Recently, the CLIP model, which is pre-trained on large-scale vision-language data, has promoted the development of zero-shot recognition tasks. Some researchers apply CLIP to zero-shot semantic segmentation, but they often struggle to achieve satisfactory results. This is because this dense prediction task requires not only a precise understanding of semantics, but also a precise perception of different regions within one image. However, CLIP is trained on image-level vision-language data, resulting in ineffective perception of pixel-level regions. In this paper, we propose a new zero-shot semantic segmentation (ZS3) method based on an embedding semantic space distribution adjustment strategy (ESDA), which enables CLIP to accurately perceive both semantics and regions. This method inserts additional trainable blocks into the CLIP image encoder, enabling it to effectively perceive regions without losing semantic understanding. Besides, we design spatial distribution losses to guide the update of parameters of the trainable blocks, thereby further enhancing the regional characteristics of pixel-level image embeddings. In addition, previous methods only obtain semantic support through a text [CLS] token, which is far from sufficient for the dense prediction task. Therefore, we design a vision-language embedding interactor, which can obtain richer semantic support through the interaction between the entire text embedding and image embedding. It can also further enhance the semantic support and strengthen the image embedding. Plenty of experiments on PASCAL-<span><math><msup><mrow><mn>5</mn></mrow><mrow><mi>i</mi></mrow></msup></math></span> and COCO-<span><math><mrow><mn>2</mn><msup><mrow><mn>0</mn></mrow><mrow><mi>i</mi></mrow></msup></mrow></math></span> prove the effectiveness of our method. Our method achieves new state-of-the-art for zero-shot semantic segmentation and exceeds many few-shot semantic segmentation methods. Codes are available at <span><span>https://github.com/Jiaguang-NEU/ESDA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105456"},"PeriodicalIF":4.2,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143421918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic consistency learning for unsupervised multi-modal person re-identification
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-13 DOI: 10.1016/j.imavis.2025.105434
Yuxin Zhang, Zhu Teng, Baopeng Zhang
{"title":"Semantic consistency learning for unsupervised multi-modal person re-identification","authors":"Yuxin Zhang,&nbsp;Zhu Teng,&nbsp;Baopeng Zhang","doi":"10.1016/j.imavis.2025.105434","DOIUrl":"10.1016/j.imavis.2025.105434","url":null,"abstract":"<div><div>Unsupervised multi-modal person re-identification poses significant challenges due to the substantial modality gap and the absence of annotations. Although previous efforts have aimed to bridge this gap by establishing modality correspondences, their focus has been confined to the feature and image level correspondences, neglecting full utilization of semantic information. To tackle these issues, we propose a Semantic Consistency Learning Network (SCLNet) for unsupervised multi-modal person re-identification. SCLNet first predicts pseudo-labels using a hierarchical clustering algorithm, which capitalizes on common semantics to perform mutual refinement across modalities and establishes cross-modality label correspondences based on semantic analysis. Besides, we also design a cross-modality loss that utilizes contrastive learning to acquire modality-invariant features, effectively reducing the inter-modality gap and enhancing the robustness of the model. Furthermore, we construct a new multi-modality dataset named Subway-TM. This dataset not only encompasses visible and infrared modalities but also includes a depth modality, captured by three cameras across 266 identities, comprising 10,645 RGB images, 10,529 infrared images, and 10,529 depth images. To the best of our knowledge, this is the first person re-identification dataset with three modalities. We conduct extensive experiments, utilizing the widely employed person re-identification datasets SYSU-MM01 and RegDB, along with our newly proposed multi-modal Subway-TM dataset. The experimental results show that our proposed method is promising compared to the current state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105434"},"PeriodicalIF":4.2,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143445921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiffusionLoc: A diffusion model-based framework for crowd localization
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-12 DOI: 10.1016/j.imavis.2025.105439
Qi Zhang , Yuan Li , Yiran Liu , Yanzhao Zhou , Jianbin Jiao
{"title":"DiffusionLoc: A diffusion model-based framework for crowd localization","authors":"Qi Zhang ,&nbsp;Yuan Li ,&nbsp;Yiran Liu ,&nbsp;Yanzhao Zhou ,&nbsp;Jianbin Jiao","doi":"10.1016/j.imavis.2025.105439","DOIUrl":"10.1016/j.imavis.2025.105439","url":null,"abstract":"<div><div>The accurate location of individuals in dense crowds remains a challenging problem and is of significant importance for crowd analysis. Traditional methods, such as box-based and map-based approaches, often fail to achieve ideal accuracy in high-density scenarios. Point-based localization methods have recently shown promising results but generally rely on heuristic priors to address localization tasks. This reliance on priors can lead to unstable performance across diverse scenarios, especially in crowds with significant density variations, where the methods struggle to generalize effectively. In this work, we introduce a framework called DiffusionLoc built upon the diffusion models, which directly generates target points from random noise, simplifying the pipeline of point-based methods. Moreover, we design a feature interpolation method, called Differential Attention-based Implicit Feature Interpolation (DF-IFI), which effectively mitigates the instability of noisy points while extracting their features. Extensive experiments show that DiffusionLoc demonstrates superior competitive performance, and adapts flexibly to different scenarios by dynamically modifying the number of noisy points and iteration steps.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105439"},"PeriodicalIF":4.2,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143429774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced residual network for burst image super-resolution using simple base frame guidance
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-12 DOI: 10.1016/j.imavis.2025.105444
Anderson Nogueira Cotrim , Gerson Barbosa , Cid Adinam Nogueira Santos , Helio Pedrini
{"title":"Enhanced residual network for burst image super-resolution using simple base frame guidance","authors":"Anderson Nogueira Cotrim ,&nbsp;Gerson Barbosa ,&nbsp;Cid Adinam Nogueira Santos ,&nbsp;Helio Pedrini","doi":"10.1016/j.imavis.2025.105444","DOIUrl":"10.1016/j.imavis.2025.105444","url":null,"abstract":"<div><div>Burst or multi-frame image super-resolution (MFSR) has emerged as a critical area in computer vision, aimed at reconstructing high-resolution images from low-resolution bursts. Unlike single-image super-resolution (SISR), which has been extensively studied, MFSR leverages information from multiple shifted frames in order to mitigate the ill-posed nature of SISR. The rapid advancement in the capabilities of handheld devices, including enhanced processing power and faster image capture rates also add a layer of relevance in this field. In our previous work, we proposed a simple yet effective deep learning method tailored for RAW images, called Simple Base Frame Burst (SBFBurst). This method, based on residual convolutional architecture, demonstrated significant performance improvements by incorporating base frame guidance mechanisms such as skip frame connections and concatenation of the base frame alongside the network. Despite the promising outcomes obtained, given the outlined context and the limited investigation compared to SISR, it is evident that further extensions and experiments are required to propel the field of MFSR forward. In this paper, we extend our recent work on SBFBurst by conducting a comprehensive analysis of the method from various perspectives. Our primary contribution lies in adapting and testing the architecture to handle both RAW Bayer pattern images and RGB images, allowing the evaluation using the novel RealBSR-RGB dataset. Our experiments revealed that SBFBurst still consistently outperforms existing state-of-the-art approaches both quantitatively and qualitatively, even after the introduction of a new method, FBANet, for comparison. We also extended our experiments to assess the impact of architecture parameters, model generalization, and its capacity to leverage complementary information. These exploratory extensions may open new avenues for advance in this field. Our code and models are publicly available at <span><span>https://github.com/AndersonCotrim/SBFBurst</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105444"},"PeriodicalIF":4.2,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143421917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AHA-track: Aggregating hierarchical awareness features for single
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-10 DOI: 10.1016/j.imavis.2025.105454
Min Yang , Zhiqing Guo , Liejun Wang
{"title":"AHA-track: Aggregating hierarchical awareness features for single","authors":"Min Yang ,&nbsp;Zhiqing Guo ,&nbsp;Liejun Wang","doi":"10.1016/j.imavis.2025.105454","DOIUrl":"10.1016/j.imavis.2025.105454","url":null,"abstract":"<div><div>Single Object Tracking (SOT) plays a crucial role in various real-world applications but still faces significant challenges, including scale variations and background distractions. While Vision Transformers (ViTs) have demonstrated improvements in tracking performance, they are often hindered by high computational costs. To address these issues, this paper propose a lightweight single object tracking model by aggregating hierarchical awareness features (AHA-Track). The template information is aggregated by aggregate token awareness module, and the key points of template are highlighted to reduce background interference. In addition, the hierarchical deep feature aggregation module has a more comprehensive understanding of object at different resolutions. It ultimately helps to improve the accuracy and robustness of challenging tracking scenes. AHA-Track enhances both tracking accuracy and speed, while maintaining computational efficiency. Extensive experimental evaluations across several benchmark datasets demonstrate that AHA-Track outperforms existing state-of-the-art methods in terms of both tracking accuracy and efficiency. The codes and pretrained models are available at <span><span>https://github.com/YangMinbobo/AHATrack</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105454"},"PeriodicalIF":4.2,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143395186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MAFMv3: An automated Multi-Scale Attention-Based Feature Fusion MobileNetv3 for spine lesion classification
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-10 DOI: 10.1016/j.imavis.2025.105440
Aqsa Dastgir , Wang Bin , Muhammad Usman Saeed , Jinfang Sheng , Salman Saleem
{"title":"MAFMv3: An automated Multi-Scale Attention-Based Feature Fusion MobileNetv3 for spine lesion classification","authors":"Aqsa Dastgir ,&nbsp;Wang Bin ,&nbsp;Muhammad Usman Saeed ,&nbsp;Jinfang Sheng ,&nbsp;Salman Saleem","doi":"10.1016/j.imavis.2025.105440","DOIUrl":"10.1016/j.imavis.2025.105440","url":null,"abstract":"<div><div>Spine lesion classification is a crucial task in medical imaging that plays a significant role in the early diagnosis and treatment of spinal conditions. In this paper, we propose an MAFMv3 (Multi-Scale Attention Feature Fusion MobileNetv3) model for automated spine lesion classification, which builds upon MobileNetv3, incorporating Attention and Atrous Spatial Pyramid Pooling (ASPP) modules to enhance focus on lesion regions and capture multi-scale features. This novel architecture uses raw, normalized, and histogram-equalized images to generate a comprehensive 3D feature map, significantly improving classification performance. Preprocessing steps include Histogram Equalization, and data augmentation techniques are applied to expand the dataset and enhance model generalization. The proposed model is evaluated on the VinDr-SpineXR publicly available dataset. The MAFMv3 model achieves state-of-the-art results with an accuracy of 96.81%, precision of 98.38%, recall of 97.95%, F1-score of 98.15%, and AUC of 99.98%, demonstrating its potential for clinical applications in medical imaging. Future work will focus on further optimizations and validating the model in real-world clinical environments to enhance its diagnostic impact.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105440"},"PeriodicalIF":4.2,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143429775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Resource-aware strategies for real-time multi-person pose estimation
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-07 DOI: 10.1016/j.imavis.2025.105441
Mohammed A. Esmail , Jinlei Wang , Yihao Wang , Li Sun , Guoliang Zhu , Guohe Zhang
{"title":"Resource-aware strategies for real-time multi-person pose estimation","authors":"Mohammed A. Esmail ,&nbsp;Jinlei Wang ,&nbsp;Yihao Wang ,&nbsp;Li Sun ,&nbsp;Guoliang Zhu ,&nbsp;Guohe Zhang","doi":"10.1016/j.imavis.2025.105441","DOIUrl":"10.1016/j.imavis.2025.105441","url":null,"abstract":"<div><div>When using deep learning applications for human posture estimation (HPE), especially on devices with limited resources, accuracy and efficiency must be balanced. Common deep-learning architectures have a propensity to use a large amount of processing power while yielding low accuracy. This work proposes the implementation of Efficient YoloPose, a new architecture based on You Only Look Once version 8 (YOLOv8)-Pose, in an attempt to address these issues. Advanced lightweight methods like Depthwise Convolution, Ghost Convolution, and the C3Ghost module are used by Efficient YoloPose to replace traditional convolution and C2f (a quicker implementation of the Cross Stage Partial Bottleneck). This approach greatly decreases the inference, parameter count, and computing complexity. To improve posture estimation even further, Efficient YoloPose integrates the Squeeze Excitation (SE) attention method into the network. The main focus of this process during posture estimation is the significant areas of an image. Experimental results show that the suggested model performs better than the current models on the COCO and OCHuman datasets. The proposed model lowers the inference time from 1.1 milliseconds (ms) to 0.9 ms, the computational complexity from 9.2 Giga Floating-point operations (GFlops) to 4.8 GFlops and the parameter count from 3.3 million to 1.3 million when compared to YOLOv8-Pose. In addition, this model maintains an average precision (AP) score of 78.8 on the COCO dataset. The source code for Efficient YoloPose has been made publicly available at [<span><span>https://github.com/malareeqi/Efficient-YoloPose</span><svg><path></path></svg></span>].</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105441"},"PeriodicalIF":4.2,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143402656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A small object detection model for drone images based on multi-attention fusion network
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-04 DOI: 10.1016/j.imavis.2025.105436
Jie Hu , Ting Pang , Bo Peng , Yongguo Shi , Tianrui Li
{"title":"A small object detection model for drone images based on multi-attention fusion network","authors":"Jie Hu ,&nbsp;Ting Pang ,&nbsp;Bo Peng ,&nbsp;Yongguo Shi ,&nbsp;Tianrui Li","doi":"10.1016/j.imavis.2025.105436","DOIUrl":"10.1016/j.imavis.2025.105436","url":null,"abstract":"<div><div>Object detection in aerial images is crucial for various applications, including precision agriculture, urban planning, disaster management, and military surveillance, as it enables the automated identification and localization of ground objects from high-altitude images. However, this field encounters several significant challenges: (1) The uneven distribution of objects; (2) High-resolution aerial images contain numerous small objects and complex backgrounds; (3) Significant variation in object sizes. To address these challenges, this paper proposes a new detection network architecture based on the fusion of multiple attention mechanisms named MAFDet. MAFDet comprises three main components: the multi-attention focusing sub-network, the multi-scale Swin transformer backbone, and the detection head. The multi-attention focusing sub-network generates attention maps to identify regions with dense small objects for precise detection. The multi-scale Swin transformer embeds the efficient multi-scale attention module into the Swin transformer block to extract better multi-layer features and mitigate background interference, thereby significantly enhancing the model’s feature extraction capability. Finally, the detector processes regions with dense small objects and global images separately, subsequently fusing the detection results to produce the final output. Experimental results demonstrate that MAFDet outperforms existing methods on widely used aerial image datasets, VisDrone and UAVDT, achieving improvements in small object detection average precision (<span><math><mrow><mi>A</mi><msub><mrow><mi>P</mi></mrow><mrow><mi>s</mi></mrow></msub></mrow></math></span>) of 1.21% and 1.98%, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105436"},"PeriodicalIF":4.2,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143334383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Markerless multi-view 3D human pose estimation: A survey 无标记多视角三维人体姿态估计:一项调查
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-02-03 DOI: 10.1016/j.imavis.2025.105437
Ana Filipa Rodrigues Nogueira , Hélder P. Oliveira , Luís F. Teixeira
{"title":"Markerless multi-view 3D human pose estimation: A survey","authors":"Ana Filipa Rodrigues Nogueira ,&nbsp;Hélder P. Oliveira ,&nbsp;Luís F. Teixeira","doi":"10.1016/j.imavis.2025.105437","DOIUrl":"10.1016/j.imavis.2025.105437","url":null,"abstract":"<div><div>3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints. The creation of accurate and efficient methods is required for several real-world applications including animation, human–robot interaction, surveillance systems or sports, among many others. However, several obstacles such as occlusions, random camera perspectives, or the scarcity of 3D labelled data, have been hampering the models’ performance and limiting their deployment in real-world scenarios. The higher availability of cameras has led researchers to explore multi-view solutions due to the advantage of being able to exploit different perspectives to reconstruct the pose.</div><div>Most existing reviews focus mainly on monocular 3D human pose estimation and a comprehensive survey only on multi-view approaches to determine the 3D pose has been missing since 2012. Thus, the goal of this survey is to fill that gap and present an overview of the methodologies related to 3D pose estimation in multi-view settings, understand what were the strategies found to address the various challenges and also, identify their limitations. According to the reviewed articles, it was possible to find that most methods are fully-supervised approaches based on geometric constraints. Nonetheless, most of the methods suffer from 2D pose mismatches, to which the incorporation of temporal consistency and depth information have been suggested to reduce the impact of this limitation, besides working directly with 3D features can completely surpass this problem but at the expense of higher computational complexity. Models with lower supervision levels were identified to overcome some of the issues related to 3D pose, particularly the scarcity of labelled datasets. Therefore, no method is yet capable of solving all the challenges associated with the reconstruction of the 3D pose. Due to the existing trade-off between complexity and performance, the best method depends on the application scenario. Therefore, further research is still required to develop an approach capable of quickly inferring a highly accurate 3D pose with bearable computation cost. To this goal, techniques such as active learning, methods that learn with a low level of supervision, the incorporation of temporal consistency, view selection, estimation of depth information and multi-modal approaches might be interesting strategies to keep in mind when developing a new methodology to solve this task.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105437"},"PeriodicalIF":4.2,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143429777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信