Computer Vision and Image Understanding最新文献

筛选
英文 中文
Self-supervised network for low-light traffic image enhancement based on deep noise and artifacts removal 基于深度去噪和去伪的低照度交通图像增强自监督网络
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-06-22 DOI: 10.1016/j.cviu.2024.104063
Houwang Zhang , Kai-Fu Yang , Yong-Jie Li , Leanne Lai-Hang Chan
{"title":"Self-supervised network for low-light traffic image enhancement based on deep noise and artifacts removal","authors":"Houwang Zhang ,&nbsp;Kai-Fu Yang ,&nbsp;Yong-Jie Li ,&nbsp;Leanne Lai-Hang Chan","doi":"10.1016/j.cviu.2024.104063","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104063","url":null,"abstract":"<div><p>In the intelligent transportation system (ITS), detecting vehicles and pedestrians in low-light conditions is challenging due to the low contrast between objects and the background. Recently, many works have enhanced low-light images using deep learning-based methods, but these methods require paired images during training, which are impractical to obtain in real-world traffic scenarios. Therefore, we propose a self-supervised network (SSN) for low-light traffic image enhancement that can be trained without paired images. To avoid amplifying noise and artifacts in the processed image during enhancement, we first proposed a denoising net to reduce the noise and artifacts in the input image. Then the processed image can be enhanced by the enhancement net. Considering the compression of the traffic image, we designed an artifacts removal net to improve the quality of the enhanced image. We proposed several effective and differential losses to make SSN trainable with low-light images only. To better integrate the extracted features from different levels in the network, we also proposed an attention module named the multi-head non-local block. In experiments, we evaluated SSN and other low-light image enhancement methods on two low-light traffic image sets: the Berkeley Deep Drive (BDD) dataset and the Hong Kong night-time multi-class vehicle (HK) dataset. The results indicated that SSN significantly improves upon other methods in visual comparison and some blind image quality metrics. We also conducted comparisons on classical ITS tasks like vehicle detection on the images enhanced by SSN and other methods, which further verified its effectiveness.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001449/pdfft?md5=ede1ba1e9b1e2fb3d3bbb81d8d671711&pid=1-s2.0-S1077314224001449-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ModelNet-O: A large-scale synthetic dataset for occlusion-aware point cloud classification ModelNet-O:用于遮挡感知点云分类的大规模合成数据集
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-06-19 DOI: 10.1016/j.cviu.2024.104060
Zhongbin Fang , Xia Li , Xiangtai Li , Shen Zhao , Mengyuan Liu
{"title":"ModelNet-O: A large-scale synthetic dataset for occlusion-aware point cloud classification","authors":"Zhongbin Fang ,&nbsp;Xia Li ,&nbsp;Xiangtai Li ,&nbsp;Shen Zhao ,&nbsp;Mengyuan Liu","doi":"10.1016/j.cviu.2024.104060","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104060","url":null,"abstract":"<div><p>Recently, 3D point cloud classification has made significant progress with the help of many datasets. However, these datasets do not reflect the incomplete nature of real-world point clouds caused by <strong><em>occlusion</em></strong>, which limits the practical application of current methods. To bridge this gap, we propose ModelNet-O, <strong>a large-scale synthetic dataset of 123,041 samples</strong> that emulates real-world point clouds with self-occlusion caused by scanning from monocular cameras. ModelNet-O is <strong><em>10 times</em></strong> larger than existing datasets and offers more challenging cases to evaluate the robustness of existing methods. Our observation on ModelNet-O reveals that <strong>well-designed sparse structures can preserve structural information of point clouds under occlusion</strong>, motivating us to propose a robust point cloud processing method that leverages a critical point sampling (CPS) strategy in a multi-level manner. We term our method PointMLS. Through extensive experiments, we demonstrate that our PointMLS achieves state-of-the-art results on ModelNet-O and competitive results on regular datasets such as ModelNet40 and ScanObjectNN, and we also demonstrate its robustness and effectiveness. Code available: <span>https://github.com/fanglaosi/ModelNet-O_PointMLS</span><svg><path></path></svg>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141480369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Confidence sharing adaptation for out-of-domain human pose and shape estimation 用于域外人体姿态和形状估计的置信度共享自适应
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-06-14 DOI: 10.1016/j.cviu.2024.104051
Tianyi Yue, Keyan Ren, Yu Shi, Hu Zhao, Qingyun Bian
{"title":"Confidence sharing adaptation for out-of-domain human pose and shape estimation","authors":"Tianyi Yue,&nbsp;Keyan Ren,&nbsp;Yu Shi,&nbsp;Hu Zhao,&nbsp;Qingyun Bian","doi":"10.1016/j.cviu.2024.104051","DOIUrl":"10.1016/j.cviu.2024.104051","url":null,"abstract":"<div><p>3D human pose and shape estimation is often impacted by distribution bias in real-world scenarios due to factors such as bone length, camera parameters, background, and occlusion. To address this issue, we propose the Confidence Sharing Adaptation (CSA) algorithm, which corrects model bias using unlabeled images from the test domain before testing. However, the lack of annotation constraints in the adaptive training process poses a significant challenge, making it susceptible to model collapse. CSA utilizes a decoupled dual-branch learning framework to provide pseudo-labels and remove noise samples based on the confidence scores of the inference results. By sharing the most confident prior knowledge between the dual-branch networks, CSA effectively mitigates distribution bias. CSA is also remarkably adaptable to severely occluded scenes, thanks to two auxiliary techniques: a self-attentive parametric regressor that ensures robustness to occlusion of local body parts and a rendered surface texture loss that regulates the relationship between occlusion of human joint positions. Evaluation results show that CSA successfully adapts to scenarios beyond the training domain and achieves state-of-the-art performance on both occlusion-specific and general benchmarks. Code and pre-trained models are available for research at <span>https://github.com/bodymapper/csa.git</span><svg><path></path></svg></p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141392614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lv-Adapter: Adapting Vision Transformers for Visual Classification with Linear-layers and Vectors Lv-Adapter:利用线性层和向量调整视觉变换器以进行视觉分类
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-06-07 DOI: 10.1016/j.cviu.2024.104049
Guangyi Xu, Junyong Ye, Xinyuan Liu, Xubin Wen, Youwei Li, Jingjing Wang
{"title":"Lv-Adapter: Adapting Vision Transformers for Visual Classification with Linear-layers and Vectors","authors":"Guangyi Xu,&nbsp;Junyong Ye,&nbsp;Xinyuan Liu,&nbsp;Xubin Wen,&nbsp;Youwei Li,&nbsp;Jingjing Wang","doi":"10.1016/j.cviu.2024.104049","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104049","url":null,"abstract":"<div><p>Large pre-trained models based on Vision Transformers (ViTs) contain nearly billions of parameters, demanding substantial computational resources and storage space. This restricts their transferability across different tasks. Recent approaches try to use adapter fine-tuning to address this drawback. However, there is still potential to improve the number of tunable parameters and the accuracy in these methods. To address this challenge, we propose an adapter fine-tuning module called Lv-Adapter, which consists of a linear layer and vector. This module enables targeted parameter fine-tuning of pretrained models by learning both the prior knowledge of pre-trained task and the information from downstream specific task, to adapt to various downstream tasks in image and video tasks while transfer learning. Compared to full fine-tuning methods, Lv-Adapter has several appealing advantages. Firstly, by adding only about 3% extra parameters to ViT, Lv-Adapter achieves comparable accuracy to full fine-tuning methods and even significantly surpasses them on action recognition benchmarks. Secondly, Lv-Adapter is a lightweight module that can be plug-and-play in different transformer models due to its simplicity. Finally, to validate these claims, extensive experiments were conducted on five image and video datasets in this study, providing evidence for the effectiveness of Lv-Adapter. When only 3.5% of the extra parameters are updated, it respectively achieves a relative boost of about 13% and 24% compared to the fully fine-tuned model on SSv2 and HMDB51.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141323899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Skeleton Cluster Tracking for robust multi-view multi-person 3D human pose estimation 用于多视角多人三维人体姿态稳健估算的骨架集群跟踪技术
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-06-07 DOI: 10.1016/j.cviu.2024.104059
Zehai Niu , Ke Lu , Jian Xue , Jinbao Wang
{"title":"Skeleton Cluster Tracking for robust multi-view multi-person 3D human pose estimation","authors":"Zehai Niu ,&nbsp;Ke Lu ,&nbsp;Jian Xue ,&nbsp;Jinbao Wang","doi":"10.1016/j.cviu.2024.104059","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104059","url":null,"abstract":"<div><p>The multi-view 3D human pose estimation task relies on 2D human pose estimation for each view; however, severe occlusion, truncation, and human interaction lead to incorrect 2D human pose estimation for some views. The traditional “Matching-Lifting-Tracking” paradigm amplifies the incorrect 2D human pose into an incorrect 3D human pose, which significantly challenges the robustness of multi-view 3D human pose estimation. In this paper, we propose a novel method that tackles the inherent difficulties of the traditional paradigm. This method is rooted in the newly devised “Skeleton Pooling-Clustering-Tracking (SPCT)” paradigm. It initiates a 2D human pose estimation for each perspective. Then a symmetrical dilated network is created for skeleton pool estimation. Upon clustering the skeleton pool, we introduce and implement an innovative tracking method that is explicitly designed for the SPCT paradigm. The tracking method refines and filters the skeleton clusters, thereby enhancing the robustness of the multi-person 3D human pose estimation results. By coupling the skeleton pool with the tracking refinement process, our method obtains high-quality multi-person 3D human pose estimation results despite severe occlusions that produce erroneous 2D and 3D estimates. By employing the proposed SPCT paradigm and a computationally efficient network architecture, our method outperformed existing approaches regarding robustness on the Shelf, 4D Association, and CMU Panoptic datasets, and could be applied in practical scenarios such as markerless motion capture and animation production.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141324067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic-driven diffusion for sign language production with gloss-pose latent spaces alignment 语义驱动的手语生产扩散与词汇潜在空间对齐
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-06-07 DOI: 10.1016/j.cviu.2024.104050
Sheng Chen, Qingshan Wang, Qi Wang
{"title":"Semantic-driven diffusion for sign language production with gloss-pose latent spaces alignment","authors":"Sheng Chen,&nbsp;Qingshan Wang,&nbsp;Qi Wang","doi":"10.1016/j.cviu.2024.104050","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104050","url":null,"abstract":"<div><p>Sign Language Production (SLP) aims to translate spoken language into visual sign language sequences. The most challenging process in SLP is the transformation of a sequence of sign glosses into corresponding sign poses (G2P). Existing approaches on G2P mainly focus on constructing mappings of sign language glosses to frame-level sign pose representations, while neglecting gloss is just a weak annotation of the sequence of sign poses. To address this problem, this paper proposes the semantic-driven diffusion model with gloss-pose latent spaces alignment (SDD-GPLA) for G2P. G2P is divided into two phases. In the first phase, we design the gloss-pose latent spaces alignment (GPLA) to model the sign pose latent representations with glosses dependency. In the second phase, we propose semantic-driven diffusion (SDD) with supervised pose reconstruction guidance as a mapping between the gloss and sign poses latent features. In addition, we propose the sign pose decoder (<span><math><msup><mrow><mtext>Decoder</mtext></mrow><mrow><mi>p</mi></mrow></msup></math></span>) to progressively generate high-resolution sign poses from latent sign pose features and to guide the SDD training process. We evaluated SDD-GPLA on a self-collected dataset of Daily Chinese Sign Language (DCSL) and a public dataset called RWTH-Phoenix-Weather-2014T. Compared with the state-of-the-art G2P methods, we obtain at least 22.9% and 2.3% improvement in WER scores on the above two datasets, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141328561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identity-preserving editing of multiple facial attributes by learning global edit directions and local adjustments 通过学习全局编辑方向和局部调整,对多个面部属性进行身份保护编辑
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-06-01 DOI: 10.1016/j.cviu.2024.104047
Najmeh Mohammadbagheri, Fardin Ayar, Ahmad Nickabadi, Reza Safabakhsh
{"title":"Identity-preserving editing of multiple facial attributes by learning global edit directions and local adjustments","authors":"Najmeh Mohammadbagheri,&nbsp;Fardin Ayar,&nbsp;Ahmad Nickabadi,&nbsp;Reza Safabakhsh","doi":"10.1016/j.cviu.2024.104047","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104047","url":null,"abstract":"<div><p>Semantic facial attribute editing using pre-trained Generative Adversarial Networks (GANs) has attracted a great deal of attention and effort from researchers in recent years. Due to the high quality of face images generated by StyleGANs, much work has focused on the StyleGANs’ latent space and the proposed methods for facial image editing. Although these methods have achieved satisfying results for manipulating user-intended attributes, they have not fulfilled the goal of preserving the identity, which is an important challenge. We present ID-Style, a new architecture capable of addressing the problem of identity loss during attribute manipulation. The key components of ID-Style include a Learnable Global Direction (LGD) module, which finds a shared and semi-sparse direction for each attribute, and an Instance-Aware Intensity Predictor (IAIP) network, which finetunes the global direction according to the input instance. Furthermore, we introduce two losses during training to enforce the LGD and IAIP to find semi-sparse semantic directions that preserve the identity of the input instance. Despite reducing the size of the network by roughly 95% as compared to similar state-of-the-art works, ID-Style outperforms baselines by 10% and 7% in identity preserving metric (FRS) and average accuracy of manipulation (mACC), respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141303008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-supervised monocular depth estimation with self-distillation and dense skip connection 利用自颤动和密集跳接进行自我监督单目深度估计
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-06-01 DOI: 10.1016/j.cviu.2024.104048
Xuezhi Xiang , Wei Li , Yao Wang , Abdulmotaleb El Saddik
{"title":"Self-supervised monocular depth estimation with self-distillation and dense skip connection","authors":"Xuezhi Xiang ,&nbsp;Wei Li ,&nbsp;Yao Wang ,&nbsp;Abdulmotaleb El Saddik","doi":"10.1016/j.cviu.2024.104048","DOIUrl":"10.1016/j.cviu.2024.104048","url":null,"abstract":"<div><p>Monocular depth estimation (MDE) is crucial in a wide range of applications, including robotics, autonomous driving and virtual reality. Self-supervised monocular depth estimation has emerged as a promising MDE approach without requiring hard-to-obtain depth labels during training, and multi-scale photometric loss is widely used for self-supervised monocular depth estimation as the self-supervised signal. However, multi-photometric loss is a weak training signal and might disturb the good intermediate features representation. In this paper, we propose a successive depth map self-distillation(SDM-SD) loss, which combines with the single-scale photometric loss to replace the multi-scale photometric loss. Moreover, considering that multi-stage feature representations are essential for dense prediction tasks such as depth estimation, we also propose a dense skip connection, which can efficiently fuse the intermediate features of the encoder and fully utilize them in each stage of the decoder in our encoder–decoder architecture. By applying successive depth map self-distillation loss and dense skip connection, our proposed method can achieve state-of-the-art performance on the KITTI benchmark, and exhibit the best generalization ability on the challenging indoor dataset NYUv2 dataset.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141282083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DHBSR: A deep hybrid representation-based network for blind image super resolution DHBSR:基于深度混合表示的盲图像超分辨率网络
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-05-28 DOI: 10.1016/j.cviu.2024.104034
Alireza Esmaeilzehi , Farshid Nooshi , Hossein Zaredar , M. Omair Ahmad
{"title":"DHBSR: A deep hybrid representation-based network for blind image super resolution","authors":"Alireza Esmaeilzehi ,&nbsp;Farshid Nooshi ,&nbsp;Hossein Zaredar ,&nbsp;M. Omair Ahmad","doi":"10.1016/j.cviu.2024.104034","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104034","url":null,"abstract":"<div><p>Image super resolution involves enhancing the spatial resolution of low-quality images and improving their visual quality. As in many real-life situations, the image degradation process is unknown, performing the task of image super resolution in a blind manner is of paramount importance. Deep neural networks provide high performances for the task of blind image super resolution, in view of their end-to-end learning capability between the low-resolution images and their ground truth versions. Generally speaking, deep blind image super resolution networks initially estimate the parameters of the image degradation process, such as blurring kernel, and then use them for super-resolving the low-resolution images. In this paper, we develop a novel deep learning-based scheme for the task of blind image super resolution, in which the idea of leveraging the hybrid representations is utilized. Specifically, we employ the deterministic and stochastic representations of the blurring kernel parameters to train a deep blind super resolution network in an effective manner. The results of extensive experiments prove the effectiveness of various ideas used in the development of the proposed deep blind image super resolution network.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141308042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
POTLoc: Pseudo-label Oriented Transformer for point-supervised temporal Action Localization POTLoc:用于点监督时间动作定位的伪标签定向变换器
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-05-28 DOI: 10.1016/j.cviu.2024.104044
Elahe Vahdani, Yingli Tian
{"title":"POTLoc: Pseudo-label Oriented Transformer for point-supervised temporal Action Localization","authors":"Elahe Vahdani,&nbsp;Yingli Tian","doi":"10.1016/j.cviu.2024.104044","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104044","url":null,"abstract":"<div><p>This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a <strong>P</strong>seudo-label <strong>O</strong>riented <strong>T</strong>ransformer for weakly-supervised Action <strong>Loc</strong>alization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of ‘pseudo-labels’ to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS’14 and ActivityNet-v1.2 datasets.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141264274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信