Computer Vision and Image Understanding最新文献

筛选
英文 中文
UATST: Towards unpaired arbitrary text-guided style transfer with cross-space modulation UATST:通过跨空间调制实现不配对的任意文本引导样式传输
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-20 DOI: 10.1016/j.cviu.2024.104246
Haibo Chen , Lei Zhao
{"title":"UATST: Towards unpaired arbitrary text-guided style transfer with cross-space modulation","authors":"Haibo Chen ,&nbsp;Lei Zhao","doi":"10.1016/j.cviu.2024.104246","DOIUrl":"10.1016/j.cviu.2024.104246","url":null,"abstract":"<div><div>Existing style transfer methods usually utilize style images to represent the target style. Since style images need to be prepared in advance and are confined to existing artworks, these methods are limited in flexibility and creativity. Compared with images, language is a more natural, common, and flexible way for humans to transmit information. Therefore, a better choice is to utilize text descriptions instead of style images to represent the target style. To this end, we propose a novel <strong>U</strong>npaired <strong>A</strong>rbitrary <strong>T</strong>ext-guided <strong>S</strong>tyle <strong>T</strong>ransfer (<strong>UATST</strong>) framework, which can render arbitrary photographs in the style of arbitrary text descriptions with one single model. To the best of our knowledge, this is the first model that achieves Arbitrary-Text-Per-Model with unpaired training data. In detail, we first use a pre-trained VGG network to map the content image into the VGG feature space, and use a pre-trained CLIP text encoder to map the text description into the CLIP feature space. Then we introduce a cross-space modulation module to bridge these two feature spaces, so that the content and style information in two different spaces can be seamlessly and adaptively combined for stylization. In addition, to learn better style representations, we introduce a new CLIP-based style contrastive loss to our model. Extensive qualitative and quantitative experiments verify the effectiveness and superiority of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104246"},"PeriodicalIF":4.3,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3D Pose Nowcasting: Forecast the future to improve the present 3D姿态临近投射:预测未来以改善现在
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-20 DOI: 10.1016/j.cviu.2024.104233
Alessandro Simoni , Francesco Marchetti , Guido Borghi , Federico Becattini , Lorenzo Seidenari , Roberto Vezzani , Alberto Del Bimbo
{"title":"3D Pose Nowcasting: Forecast the future to improve the present","authors":"Alessandro Simoni ,&nbsp;Francesco Marchetti ,&nbsp;Guido Borghi ,&nbsp;Federico Becattini ,&nbsp;Lorenzo Seidenari ,&nbsp;Roberto Vezzani ,&nbsp;Alberto Del Bimbo","doi":"10.1016/j.cviu.2024.104233","DOIUrl":"10.1016/j.cviu.2024.104233","url":null,"abstract":"<div><div>Technologies to enable safe and effective collaboration and coexistence between humans and robots have gained significant importance in the last few years. A critical component useful for realizing this collaborative paradigm is the understanding of human and robot 3D poses using non-invasive systems. Therefore, in this paper, we propose a novel vision-based system leveraging depth data to accurately establish the 3D locations of skeleton joints. Specifically, we introduce the concept of Pose Nowcasting, denoting the capability of the proposed system to enhance its current pose estimation accuracy by jointly learning to forecast future poses. The experimental evaluation is conducted on two different datasets, providing accurate and real-time performance and confirming the validity of the proposed method on both the robotic and human scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104233"},"PeriodicalIF":4.3,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142757045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Scale Adaptive Skeleton Transformer for action recognition 用于动作识别的多尺度自适应骨架变换器
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-19 DOI: 10.1016/j.cviu.2024.104229
Xiaotian Wang , Kai Chen , Zhifu Zhao , Guangming Shi , Xuemei Xie , Xiang Jiang , Yifan Yang
{"title":"Multi-Scale Adaptive Skeleton Transformer for action recognition","authors":"Xiaotian Wang ,&nbsp;Kai Chen ,&nbsp;Zhifu Zhao ,&nbsp;Guangming Shi ,&nbsp;Xuemei Xie ,&nbsp;Xiang Jiang ,&nbsp;Yifan Yang","doi":"10.1016/j.cviu.2024.104229","DOIUrl":"10.1016/j.cviu.2024.104229","url":null,"abstract":"<div><div>Transformer has demonstrated remarkable performance in various computer vision tasks. However, its potential is not fully explored in skeleton-based action recognition. On one hand, existing methods primarily utilize fixed function or pre-learned matrix to encode position information, while overlooking the sample-specific position information. On the other hand, these approaches focus on single-scale spatial relationships, while neglecting the discriminative fine-grained and coarse-grained spatial features. To address these issues, we propose a Multi-Scale Adaptive Skeleton Transformer (MSAST), including Adaptive Skeleton Position Encoding Module (ASPEM), Multi-Scale Embedding Module (MSEM), and Adaptive Relative Location Module (ARLM). ASPEM decouples spatial–temporal information in the position encoding procedure, which acquires inherent dependencies of skeleton sequences. ASPEM is also designed to be dependent on input tokens, which can learn sample-specific position information. The MSEM employs multi-scale pooling to generate multi-scale tokens that contain multi-grained features. Then, the spatial transformer captures multi-scale relations to address the subtle differences between various actions. Another contribution of this paper is that ARLM is presented to mine suitable location information for better recognition performance. Extensive experiments conducted on three benchmark datasets demonstrate that the proposed model achieves Top-1 accuracy of 94.9%/97.5% on NTU-60 C-Sub/C-View, 88.7%/91.6% on NTU-120 X-Sub/X-Set and 97.4% on NW-UCLA, respectively.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104229"},"PeriodicalIF":4.3,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Open-set domain adaptation with visual-language foundation models 利用视觉语言基础模型进行开放集领域适应性调整
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-19 DOI: 10.1016/j.cviu.2024.104230
Qing Yu , Go Irie , Kiyoharu Aizawa
{"title":"Open-set domain adaptation with visual-language foundation models","authors":"Qing Yu ,&nbsp;Go Irie ,&nbsp;Kiyoharu Aizawa","doi":"10.1016/j.cviu.2024.104230","DOIUrl":"10.1016/j.cviu.2024.104230","url":null,"abstract":"<div><div>Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104230"},"PeriodicalIF":4.3,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142722825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Action assessment in rehabilitation: Leveraging machine learning and vision-based analysis 康复中的行动评估:利用机器学习和基于视觉的分析
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-19 DOI: 10.1016/j.cviu.2024.104228
Alaa Kryeem , Noy Boutboul , Itai Bear , Shmuel Raz , Dana Eluz , Dorit Itah , Hagit Hel-Or , Ilan Shimshoni
{"title":"Action assessment in rehabilitation: Leveraging machine learning and vision-based analysis","authors":"Alaa Kryeem ,&nbsp;Noy Boutboul ,&nbsp;Itai Bear ,&nbsp;Shmuel Raz ,&nbsp;Dana Eluz ,&nbsp;Dorit Itah ,&nbsp;Hagit Hel-Or ,&nbsp;Ilan Shimshoni","doi":"10.1016/j.cviu.2024.104228","DOIUrl":"10.1016/j.cviu.2024.104228","url":null,"abstract":"<div><div>Post-hip replacement rehabilitation often depends on exercises under medical supervision. Yet, the lack of therapists, financial limits, and inconsistent evaluations call for a more user-friendly, accessible approach. Our proposed solution is a scalable, affordable system based on computer vision, leveraging machine learning and 2D cameras to provide tailored monitoring. This system is designed to address the shortcomings of conventional rehab methods, facilitating effective healthcare at home. The system’s key feature is the use of DTAN deep learning approach to synchronize exercise data over time, which guarantees precise analysis and evaluation. We also introduce a ‘Golden Feature’—a spatio-temporal element that embodies the essential movement of the exercise, serving as the foundation for aligning signals and identifying crucial exercise intervals. The system employs automated feature extraction and selection, offering valuable insights into the execution of exercises and enhancing the system’s precision. Moreover, it includes a multi-label ML model that not only predicts exercise scores but also forecasts therapists’ feedback for exercises performed partially. Performance of the proposed system is shown to be predict exercise scores with accuracy between 82% and 95%. Due to the automatic feature selection, and alignment methods, the proposed framework is easily scalable to additional exercises.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104228"},"PeriodicalIF":4.3,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142757044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging vision-language prompts for real-world image restoration and enhancement 利用视觉语言提示进行真实世界图像修复和增强
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-16 DOI: 10.1016/j.cviu.2024.104222
Yanyan Wei , Yilin Zhang , Kun Li , Fei Wang , Shengeng Tang , Zhao Zhang
{"title":"Leveraging vision-language prompts for real-world image restoration and enhancement","authors":"Yanyan Wei ,&nbsp;Yilin Zhang ,&nbsp;Kun Li ,&nbsp;Fei Wang ,&nbsp;Shengeng Tang ,&nbsp;Zhao Zhang","doi":"10.1016/j.cviu.2024.104222","DOIUrl":"10.1016/j.cviu.2024.104222","url":null,"abstract":"<div><div>Significant advancements have been made in image restoration methods aimed at removing adverse weather effects. However, due to natural constraints, it is challenging to collect real-world datasets for adverse weather removal tasks. Consequently, existing methods predominantly rely on synthetic datasets, which struggle to generalize to real-world data, thereby limiting their practical utility. While some real-world adverse weather removal datasets have emerged, their design, which involves capturing ground truths at a different moment, inevitably introduces interfering discrepancies between the degraded images and the ground truths. These discrepancies include variations in brightness, color, contrast, and minor misalignments. Meanwhile, real-world datasets typically involve complex rather than singular degradation types. In many samples, degradation features are not overt, which poses immense challenges to real-world adverse weather removal methodologies. To tackle these issues, we introduce the recently prominent vision-language model, CLIP, to aid in the image restoration process. An expanded and fine-tuned CLIP model acts as an ‘expert’, leveraging the image priors acquired through large-scale pre-training to guide the operation of the image restoration model. Additionally, we generate a set of pseudo-ground-truths on sequences of degraded images to further alleviate the difficulty for the model in fitting the data. To imbue the model with more prior knowledge about degradation characteristics, we also incorporate additional synthetic training data. Lastly, the progressive learning and fine-tuning strategies employed during training enhance the model’s final performance, enabling our method to surpass existing approaches in both visual quality and objective image quality assessment metrics.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104222"},"PeriodicalIF":4.3,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RetSeg3D: Retention-based 3D semantic segmentation for autonomous driving RetSeg3D:用于自动驾驶的基于保留的 3D 语义分割
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-15 DOI: 10.1016/j.cviu.2024.104231
Gopi Krishna Erabati, Helder Araujo
{"title":"RetSeg3D: Retention-based 3D semantic segmentation for autonomous driving","authors":"Gopi Krishna Erabati,&nbsp;Helder Araujo","doi":"10.1016/j.cviu.2024.104231","DOIUrl":"10.1016/j.cviu.2024.104231","url":null,"abstract":"<div><div>LiDAR semantic segmentation is one of the crucial tasks for scene understanding in autonomous driving. Recent trends suggest that voxel- or fusion-based methods obtain improved performance. However, the fusion-based methods are computationally expensive. On the other hand, the voxel-based methods uniformly employ local operators (e.g., 3D SparseConv) without considering the varying-density property of LiDAR point clouds, which result in inferior performance, specifically on far away sparse points due to limited receptive field. To tackle this issue, we propose novel retention block to capture long-range dependencies, maintain the receptive field of far away sparse points and design <strong>RetSeg3D</strong>, a retention-based 3D semantic segmentation model for autonomous driving. Instead of vanilla attention mechanism to model long-range dependencies, inspired by RetNet, we design cubic window multi-scale retentive self-attention (CW-MSRetSA) module with bidirectional and 3D explicit decay mechanism to introduce 3D spatial distance related prior information into the model to improve not only the receptive field but also the model capacity. Our novel retention block maintains the receptive field which significantly improve the performance of far away sparse points. We conduct extensive experiments and analysis on three large-scale datasets: SemanticKITTI, nuScenes and Waymo. Our method not only outperforms existing methods on far away sparse points but also on close and medium distance points and efficiently runs in real time at 52.1 FPS on a RTX 4090 GPU.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104231"},"PeriodicalIF":4.3,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SANet: Selective Aggregation Network for unsupervised object re-identification SANet:用于无监督对象再识别的选择性聚合网络
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-15 DOI: 10.1016/j.cviu.2024.104232
Minghui Lin, Jianhua Tang, Longbin Fu, Zhengrong Zuo
{"title":"SANet: Selective Aggregation Network for unsupervised object re-identification","authors":"Minghui Lin,&nbsp;Jianhua Tang,&nbsp;Longbin Fu,&nbsp;Zhengrong Zuo","doi":"10.1016/j.cviu.2024.104232","DOIUrl":"10.1016/j.cviu.2024.104232","url":null,"abstract":"<div><div>Recent advancements in unsupervised object re-identification have witnessed remarkable progress, which usually focuses on capturing fine-grained semantic information through partitioning or relying on auxiliary networks for optimizing label consistency. However, incorporating extra complex partitioning mechanisms and models leads to non-negligible optimization difficulties, resulting in limited performance gains. To address these problems, this paper presents a Selective Aggregation Network (SANet) to obtain high-quality features and labels for unsupervised object re-identification, which explores primitive fine-grained information of large-scale pre-trained models such as CLIP and designs customized modifications. Specifically, we propose an adaptive selective aggregation module that chooses a set of tokens based on CLIP’s attention scores to aggregate discriminative global features. Built upon the representations output by the adaptive selective aggregation module, we design a dynamic weighted clustering algorithm to obtain accurate confidence-weighted pseudo-class centers for contrastive learning. In addition, a dual confidence judgment strategy is introduced to refine and correct the pseudo-labels by assigning three categories of samples through their noise degree. By this means, the proposed SANet enables discriminative feature extraction and clustering refinement for more precise classification without complex architectures such as feature partitioning or auxiliary models. Extensive experiments on existing standard unsupervised object re-identification benchmarks, including Market1501, MSMT17, and Veri776, demonstrate the effectiveness of the proposed SANet method, and SANet achieves state-of-the-art results over other strong competitors.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104232"},"PeriodicalIF":4.3,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scene-cGAN: A GAN for underwater restoration and scene depth estimation Scene-cGAN:用于水下复原和场景深度估计的 GAN
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-13 DOI: 10.1016/j.cviu.2024.104225
Salma González-Sabbagh , Antonio Robles-Kelly , Shang Gao
{"title":"Scene-cGAN: A GAN for underwater restoration and scene depth estimation","authors":"Salma González-Sabbagh ,&nbsp;Antonio Robles-Kelly ,&nbsp;Shang Gao","doi":"10.1016/j.cviu.2024.104225","DOIUrl":"10.1016/j.cviu.2024.104225","url":null,"abstract":"<div><div>Despite their wide scope of application, the development of underwater models for image restoration and scene depth estimation is not a straightforward task due to the limited size and quality of underwater datasets, as well as variations in water colours resulting from attenuation, absorption and scattering phenomena in the water column. To address these challenges, we present an all-in-one conditional generative adversarial network (cGAN) called Scene-cGAN. Our cGAN is a physics-based multi-domain model designed for image dewatering, restoration and depth estimation. It comprises three generators and one discriminator. To train our Scene-cGAN, we use a multi-term loss function based on uni-directional cycle-consistency and a novel dataset. This dataset is constructed from RGB-D in-air images using spectral data and concentrations of water constituents obtained from real-world water quality surveys. This approach allows us to produce imagery consistent with the radiance and veiling light corresponding to representative water types. Additionally, we compare Scene-cGAN with current state-of-the-art methods using various datasets. Results demonstrate its competitiveness in terms of colour restoration and its effectiveness in estimating the depth information for complex underwater scenes.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104225"},"PeriodicalIF":4.3,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142653911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
2S-SGCN: A two-stage stratified graph convolutional network model for facial landmark detection on 3D data 2S-SGCN:用于三维数据面部地标检测的两级分层图卷积网络模型
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-12 DOI: 10.1016/j.cviu.2024.104227
Jacopo Burger, Giorgio Blandano, Giuseppe Maurizio Facchi, Raffaella Lanzarotti
{"title":"2S-SGCN: A two-stage stratified graph convolutional network model for facial landmark detection on 3D data","authors":"Jacopo Burger,&nbsp;Giorgio Blandano,&nbsp;Giuseppe Maurizio Facchi,&nbsp;Raffaella Lanzarotti","doi":"10.1016/j.cviu.2024.104227","DOIUrl":"10.1016/j.cviu.2024.104227","url":null,"abstract":"<div><div>Facial Landmark Detection (FLD) algorithms play a crucial role in numerous computer vision applications, particularly in tasks such as face recognition, head pose estimation, and facial expression analysis. While FLD on images has long been the focus, the emergence of 3D data has led to a surge of interest in FLD on it due to its potential applications in various fields, including medical research. However, automating FLD in this context presents significant challenges, such as selecting suitable network architectures, refining outputs for precise landmark localization and optimizing computational efficiency. In response, this paper presents a novel approach, the 2-Stage Stratified Graph Convolutional Network (<span>2S-SGCN</span>), which addresses these challenges comprehensively. The first stage aims to detect landmark regions using heatmap regression, which leverages both local and long-range dependencies through a stratified approach. In the second stage, 3D landmarks are precisely determined using a new post-processing technique, namely <span>MSE-over-mesh</span>. <span>2S-SGCN</span> ensures both efficiency and suitability for resource-constrained devices. Experimental results on 3D scans from the public Facescape and Headspace datasets, as well as on point clouds derived from FLAME meshes collected in the DAD-3DHeads dataset, demonstrate that the proposed method achieves state-of-the-art performance across various conditions. Source code is accessible at <span><span>https://github.com/gfacchi-dev/CVIU-2S-SGCN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104227"},"PeriodicalIF":4.3,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142653910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信