arXiv - CS - Computer Vision and Pattern Recognition最新文献_第4页

Tracking Any Point with Frame-Event Fusion Network at High Frame Rate 利用帧-事件融合网络以高帧频跟踪任意点

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11953

Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu

{"title":"Tracking Any Point with Frame-Event Fusion Network at High Frame Rate","authors":"Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu","doi":"arxiv-2409.11953","DOIUrl":"https://doi.org/arxiv-2409.11953","url":null,"abstract":"Tracking any point based on image frames is constrained by frame rates,\u0000leading to instability in high-speed scenarios and limited generalization in\u0000real-world applications. To overcome these limitations, we propose an\u0000image-event fusion point tracker, FE-TAP, which combines the contextual\u0000information from image frames with the high temporal resolution of events,\u0000achieving high frame rate and robust point tracking under various challenging\u0000conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to\u0000model the image generation process guided by events. This module can\u0000effectively integrate valuable information from both modalities operating at\u0000different frequencies. To achieve smoother point trajectories, we employed a\u0000transformer-based refinement strategy that updates the point's trajectories and\u0000features iteratively. Extensive experiments demonstrate that our method\u0000outperforms state-of-the-art approaches, particularly improving expected\u0000feature age by 24$%$ on EDS datasets. Finally, we qualitatively validated the\u0000robustness of our algorithm in real driving scenarios using our custom-designed\u0000high-resolution image-event synchronization device. Our source code will be\u0000released at https://github.com/ljx1002/FE-TAP.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Panoptic-Depth Forecasting 全景深度预测

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12008

Juana Valeria Hurtado, Riya Mohan, Abhinav Valada

{"title":"Panoptic-Depth Forecasting","authors":"Juana Valeria Hurtado, Riya Mohan, Abhinav Valada","doi":"arxiv-2409.12008","DOIUrl":"https://doi.org/arxiv-2409.12008","url":null,"abstract":"Forecasting the semantics and 3D structure of scenes is essential for robots\u0000to navigate and plan actions safely. Recent methods have explored semantic and\u0000panoptic scene forecasting; however, they do not consider the geometry of the\u0000scene. In this work, we propose the panoptic-depth forecasting task for jointly\u0000predicting the panoptic segmentation and depth maps of unobserved future\u0000frames, from monocular camera images. To facilitate this work, we extend the\u0000popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR\u0000point clouds and leveraging sequential labeled data. We also introduce a\u0000suitable evaluation metric that quantifies both the panoptic quality and depth\u0000estimation accuracy of forecasts in a coherent manner. Furthermore, we present\u0000two baselines and propose the novel PDcast architecture that learns rich\u0000spatio-temporal representations by incorporating a transformer-based encoder, a\u0000forecasting module, and task-specific decoders to predict future panoptic-depth\u0000outputs. Extensive evaluations demonstrate the effectiveness of PDcast across\u0000two datasets and three forecasting tasks, consistently addressing the primary\u0000challenges. We make the code publicly available at\u0000https://pdcast.cs.uni-freiburg.de.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Differentiable Collision-Supervised Tooth Arrangement Network with a Decoupling Perspective 从解耦角度看可微分碰撞监督齿排列网络

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11937

Zhihui He, Chengyuan Wang, Shidong Yang, Li Chen, Yanheng Zhou, Shuo Wang

{"title":"Differentiable Collision-Supervised Tooth Arrangement Network with a Decoupling Perspective","authors":"Zhihui He, Chengyuan Wang, Shidong Yang, Li Chen, Yanheng Zhou, Shuo Wang","doi":"arxiv-2409.11937","DOIUrl":"https://doi.org/arxiv-2409.11937","url":null,"abstract":"Tooth arrangement is an essential step in the digital orthodontic planning\u0000process. Existing learning-based methods use hidden teeth features to directly\u0000regress teeth motions, which couples target pose perception and motion\u0000regression. It could lead to poor perceptions of three-dimensional\u0000transformation. They also ignore the possible overlaps or gaps between teeth of\u0000predicted dentition, which is generally unacceptable. Therefore, we propose\u0000DTAN, a differentiable collision-supervised tooth arrangement network,\u0000decoupling predicting tasks and feature modeling. DTAN decouples the tooth\u0000arrangement task by first predicting the hidden features of the final teeth\u0000poses and then using them to assist in regressing the motions between the\u0000beginning and target teeth. To learn the hidden features better, DTAN also\u0000decouples the teeth-hidden features into geometric and positional features,\u0000which are further supervised by feature consistency constraints. Furthermore,\u0000we propose a novel differentiable collision loss function for point cloud data\u0000to constrain the related gestures between teeth, which can be easily extended\u0000to other 3D point cloud tasks. We propose an arch-width guided tooth\u0000arrangement network, named C-DTAN, to make the results controllable. We\u0000construct three different tooth arrangement datasets and achieve drastically\u0000improved performance on accuracy and speed compared with existing methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-End Probabilistic Geometry-Guided Regression for 6DoF Object Pose Estimation 用于 6DoF 物体姿态估计的端到端概率几何引导回归技术

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11819

Thomas Pöllabauer, Jiayin Li, Volker Knauthe, Sarah Berkei, Arjan Kuijper

{"title":"End-to-End Probabilistic Geometry-Guided Regression for 6DoF Object Pose Estimation","authors":"Thomas Pöllabauer, Jiayin Li, Volker Knauthe, Sarah Berkei, Arjan Kuijper","doi":"arxiv-2409.11819","DOIUrl":"https://doi.org/arxiv-2409.11819","url":null,"abstract":"6D object pose estimation is the problem of identifying the position and\u0000orientation of an object relative to a chosen coordinate system, which is a\u0000core technology for modern XR applications. State-of-the-art 6D object pose\u0000estimators directly predict an object pose given an object observation. Due to\u0000the ill-posed nature of the pose estimation problem, where multiple different\u0000poses can correspond to a single observation, generating additional plausible\u0000estimates per observation can be valuable. To address this, we reformulate the\u0000state-of-the-art algorithm GDRNPP and introduce EPRO-GDR (End-to-End\u0000Probabilistic Geometry-Guided Regression). Instead of predicting a single pose\u0000per detection, we estimate a probability density distribution of the pose.\u0000Using the evaluation procedure defined by the BOP (Benchmark for 6D Object Pose\u0000Estimation) Challenge, we test our approach on four of its core datasets and\u0000demonstrate superior quantitative results for EPRO-GDR on LM-O, YCB-V, and\u0000ITODD. Our probabilistic solution shows that predicting a pose distribution\u0000instead of a single pose can improve state-of-the-art single-view pose\u0000estimation while providing the additional benefit of being able to sample\u0000multiple meaningful pose candidates.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation JEAN：基于联合表情和音频引导的 NeRF 会说话人脸生成技术

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.12156

Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras

{"title":"JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation","authors":"Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras","doi":"arxiv-2409.12156","DOIUrl":"https://doi.org/arxiv-2409.12156","url":null,"abstract":"We introduce a novel method for joint expression and audio-guided talking\u0000face generation. Recent approaches either struggle to preserve the speaker\u0000identity or fail to produce faithful facial expressions. To address these\u0000challenges, we propose a NeRF-based network. Since we train our network on\u0000monocular videos without any ground truth, it is essential to learn\u0000disentangled representations for audio and expression. We first learn audio\u0000features in a self-supervised manner, given utterances from multiple subjects.\u0000By incorporating a contrastive learning technique, we ensure that the learned\u0000audio features are aligned to the lip motion and disentangled from the muscle\u0000motion of the rest of the face. We then devise a transformer-based architecture\u0000that learns expression features, capturing long-range facial expressions and\u0000disentangling them from the speech-specific mouth movements. Through\u0000quantitative and qualitative evaluation, we demonstrate that our method can\u0000synthesize high-fidelity talking face videos, achieving state-of-the-art facial\u0000expression transfer along with lip synchronization to unseen audio.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Agglomerative Token Clustering 聚类令牌聚类

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11923

Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund

引用次数: 0

LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models LLM-wrapper：视觉语言基础模型的黑盒语义感知适配

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11919

Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord

引用次数: 0

InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models InverseMeetInsert：通过引导扩散模型中的几何累积反演进行稳健的真实图像编辑

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11734

Yan Zheng, Lemeng Wu

引用次数: 0

Distillation-free Scaling of Large SSMs for Images and Videos 图像和视频大型 SSM 的无蒸馏缩放

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11867

Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall

{"title":"Distillation-free Scaling of Large SSMs for Images and Videos","authors":"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall","doi":"arxiv-2409.11867","DOIUrl":"https://doi.org/arxiv-2409.11867","url":null,"abstract":"State-space models (SSMs), exemplified by S4, have introduced a novel context\u0000modeling method by integrating state-space techniques into deep learning.\u0000However, they struggle with global context modeling due to their\u0000data-independent matrices. The Mamba model addressed this with data-dependent\u0000variants via the S6 selective-scan algorithm, enhancing context modeling,\u0000especially for long sequences. However, Mamba-based architectures are difficult\u0000to scale with respect to the number of parameters, which is a major limitation\u0000for vision applications. This paper addresses the scalability issue of large\u0000SSMs for image classification and action recognition without requiring\u0000additional techniques like knowledge distillation. We analyze the distinct\u0000characteristics of Mamba-based and Attention-based models, proposing a\u0000Mamba-Attention interleaved architecture that enhances scalability, robustness,\u0000and performance. We demonstrate that the stable and efficient interleaved\u0000architecture resolves the scalability issue of Mamba-based architectures for\u0000images and videos and increases robustness to common artifacts like JPEG\u0000compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\u0000Something-Something-v2 benchmarks demonstrates that our approach improves the\u0000accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Chinese Continuous Sign Language Dataset Based on Complex Environments 基于复杂环境的中文连续手语数据集

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI: arxiv-2409.11960

Qidan Zhu, Jing Li, Fei Yuan, Jiaojiao Fan, Quan Gan

{"title":"A Chinese Continuous Sign Language Dataset Based on Complex Environments","authors":"Qidan Zhu, Jing Li, Fei Yuan, Jiaojiao Fan, Quan Gan","doi":"arxiv-2409.11960","DOIUrl":"https://doi.org/arxiv-2409.11960","url":null,"abstract":"The current bottleneck in continuous sign language recognition (CSLR)\u0000research lies in the fact that most publicly available datasets are limited to\u0000laboratory environments or television program recordings, resulting in a single\u0000background environment with uniform lighting, which significantly deviates from\u0000the diversity and complexity found in real-life scenarios. To address this\u0000challenge, we have constructed a new, large-scale dataset for Chinese\u0000continuous sign language (CSL) based on complex environments, termed the\u0000complex environment - chinese sign language dataset (CE-CSL). This dataset\u0000encompasses 5,988 continuous CSL video clips collected from daily life scenes,\u0000featuring more than 70 different complex backgrounds to ensure\u0000representativeness and generalization capability. To tackle the impact of\u0000complex backgrounds on CSLR performance, we propose a time-frequency network\u0000(TFNet) model for continuous sign language recognition. This model extracts\u0000frame-level features and then utilizes both temporal and spectral information\u0000to separately derive sequence features before fusion, aiming to achieve\u0000efficient and accurate CSLR. Experimental results demonstrate that our approach\u0000achieves significant performance improvements on the CE-CSL, validating its\u0000effectiveness under complex background conditions. Additionally, our proposed\u0000method has also yielded highly competitive results when applied to three\u0000publicly available CSL datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0