Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu
{"title":"Tracking Any Point with Frame-Event Fusion Network at High Frame Rate","authors":"Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu","doi":"arxiv-2409.11953","DOIUrl":"https://doi.org/arxiv-2409.11953","url":null,"abstract":"Tracking any point based on image frames is constrained by frame rates,\u0000leading to instability in high-speed scenarios and limited generalization in\u0000real-world applications. To overcome these limitations, we propose an\u0000image-event fusion point tracker, FE-TAP, which combines the contextual\u0000information from image frames with the high temporal resolution of events,\u0000achieving high frame rate and robust point tracking under various challenging\u0000conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to\u0000model the image generation process guided by events. This module can\u0000effectively integrate valuable information from both modalities operating at\u0000different frequencies. To achieve smoother point trajectories, we employed a\u0000transformer-based refinement strategy that updates the point's trajectories and\u0000features iteratively. Extensive experiments demonstrate that our method\u0000outperforms state-of-the-art approaches, particularly improving expected\u0000feature age by 24$%$ on EDS datasets. Finally, we qualitatively validated the\u0000robustness of our algorithm in real driving scenarios using our custom-designed\u0000high-resolution image-event synchronization device. Our source code will be\u0000released at https://github.com/ljx1002/FE-TAP.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Panoptic-Depth Forecasting","authors":"Juana Valeria Hurtado, Riya Mohan, Abhinav Valada","doi":"arxiv-2409.12008","DOIUrl":"https://doi.org/arxiv-2409.12008","url":null,"abstract":"Forecasting the semantics and 3D structure of scenes is essential for robots\u0000to navigate and plan actions safely. Recent methods have explored semantic and\u0000panoptic scene forecasting; however, they do not consider the geometry of the\u0000scene. In this work, we propose the panoptic-depth forecasting task for jointly\u0000predicting the panoptic segmentation and depth maps of unobserved future\u0000frames, from monocular camera images. To facilitate this work, we extend the\u0000popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR\u0000point clouds and leveraging sequential labeled data. We also introduce a\u0000suitable evaluation metric that quantifies both the panoptic quality and depth\u0000estimation accuracy of forecasts in a coherent manner. Furthermore, we present\u0000two baselines and propose the novel PDcast architecture that learns rich\u0000spatio-temporal representations by incorporating a transformer-based encoder, a\u0000forecasting module, and task-specific decoders to predict future panoptic-depth\u0000outputs. Extensive evaluations demonstrate the effectiveness of PDcast across\u0000two datasets and three forecasting tasks, consistently addressing the primary\u0000challenges. We make the code publicly available at\u0000https://pdcast.cs.uni-freiburg.de.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihui He, Chengyuan Wang, Shidong Yang, Li Chen, Yanheng Zhou, Shuo Wang
{"title":"Differentiable Collision-Supervised Tooth Arrangement Network with a Decoupling Perspective","authors":"Zhihui He, Chengyuan Wang, Shidong Yang, Li Chen, Yanheng Zhou, Shuo Wang","doi":"arxiv-2409.11937","DOIUrl":"https://doi.org/arxiv-2409.11937","url":null,"abstract":"Tooth arrangement is an essential step in the digital orthodontic planning\u0000process. Existing learning-based methods use hidden teeth features to directly\u0000regress teeth motions, which couples target pose perception and motion\u0000regression. It could lead to poor perceptions of three-dimensional\u0000transformation. They also ignore the possible overlaps or gaps between teeth of\u0000predicted dentition, which is generally unacceptable. Therefore, we propose\u0000DTAN, a differentiable collision-supervised tooth arrangement network,\u0000decoupling predicting tasks and feature modeling. DTAN decouples the tooth\u0000arrangement task by first predicting the hidden features of the final teeth\u0000poses and then using them to assist in regressing the motions between the\u0000beginning and target teeth. To learn the hidden features better, DTAN also\u0000decouples the teeth-hidden features into geometric and positional features,\u0000which are further supervised by feature consistency constraints. Furthermore,\u0000we propose a novel differentiable collision loss function for point cloud data\u0000to constrain the related gestures between teeth, which can be easily extended\u0000to other 3D point cloud tasks. We propose an arch-width guided tooth\u0000arrangement network, named C-DTAN, to make the results controllable. We\u0000construct three different tooth arrangement datasets and achieve drastically\u0000improved performance on accuracy and speed compared with existing methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Pöllabauer, Jiayin Li, Volker Knauthe, Sarah Berkei, Arjan Kuijper
{"title":"End-to-End Probabilistic Geometry-Guided Regression for 6DoF Object Pose Estimation","authors":"Thomas Pöllabauer, Jiayin Li, Volker Knauthe, Sarah Berkei, Arjan Kuijper","doi":"arxiv-2409.11819","DOIUrl":"https://doi.org/arxiv-2409.11819","url":null,"abstract":"6D object pose estimation is the problem of identifying the position and\u0000orientation of an object relative to a chosen coordinate system, which is a\u0000core technology for modern XR applications. State-of-the-art 6D object pose\u0000estimators directly predict an object pose given an object observation. Due to\u0000the ill-posed nature of the pose estimation problem, where multiple different\u0000poses can correspond to a single observation, generating additional plausible\u0000estimates per observation can be valuable. To address this, we reformulate the\u0000state-of-the-art algorithm GDRNPP and introduce EPRO-GDR (End-to-End\u0000Probabilistic Geometry-Guided Regression). Instead of predicting a single pose\u0000per detection, we estimate a probability density distribution of the pose.\u0000Using the evaluation procedure defined by the BOP (Benchmark for 6D Object Pose\u0000Estimation) Challenge, we test our approach on four of its core datasets and\u0000demonstrate superior quantitative results for EPRO-GDR on LM-O, YCB-V, and\u0000ITODD. Our probabilistic solution shows that predicting a pose distribution\u0000instead of a single pose can improve state-of-the-art single-view pose\u0000estimation while providing the additional benefit of being able to sample\u0000multiple meaningful pose candidates.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models","authors":"Yan Zheng, Lemeng Wu","doi":"arxiv-2409.11734","DOIUrl":"https://doi.org/arxiv-2409.11734","url":null,"abstract":"In this paper, we introduce Geometry-Inverse-Meet-Pixel-Insert, short for\u0000GEO, an exceptionally versatile image editing technique designed to cater to\u0000customized user requirements at both local and global scales. Our approach\u0000seamlessly integrates text prompts and image prompts to yield diverse and\u0000precise editing outcomes. Notably, our method operates without the need for\u0000training and is driven by two key contributions: (i) a novel geometric\u0000accumulation loss that enhances DDIM inversion to faithfully preserve pixel\u0000space geometry and layout, and (ii) an innovative boosted image prompt\u0000technique that combines pixel-level editing for text-only inversion with latent\u0000space geometry guidance for standard classifier-free reversion. Leveraging the\u0000publicly available Stable Diffusion model, our approach undergoes extensive\u0000evaluation across various image types and challenging prompt editing scenarios,\u0000consistently delivering high-fidelity editing results for real images.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall
{"title":"Distillation-free Scaling of Large SSMs for Images and Videos","authors":"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall","doi":"arxiv-2409.11867","DOIUrl":"https://doi.org/arxiv-2409.11867","url":null,"abstract":"State-space models (SSMs), exemplified by S4, have introduced a novel context\u0000modeling method by integrating state-space techniques into deep learning.\u0000However, they struggle with global context modeling due to their\u0000data-independent matrices. The Mamba model addressed this with data-dependent\u0000variants via the S6 selective-scan algorithm, enhancing context modeling,\u0000especially for long sequences. However, Mamba-based architectures are difficult\u0000to scale with respect to the number of parameters, which is a major limitation\u0000for vision applications. This paper addresses the scalability issue of large\u0000SSMs for image classification and action recognition without requiring\u0000additional techniques like knowledge distillation. We analyze the distinct\u0000characteristics of Mamba-based and Attention-based models, proposing a\u0000Mamba-Attention interleaved architecture that enhances scalability, robustness,\u0000and performance. We demonstrate that the stable and efficient interleaved\u0000architecture resolves the scalability issue of Mamba-based architectures for\u0000images and videos and increases robustness to common artifacts like JPEG\u0000compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\u0000Something-Something-v2 benchmarks demonstrates that our approach improves the\u0000accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qidan Zhu, Jing Li, Fei Yuan, Jiaojiao Fan, Quan Gan
{"title":"A Chinese Continuous Sign Language Dataset Based on Complex Environments","authors":"Qidan Zhu, Jing Li, Fei Yuan, Jiaojiao Fan, Quan Gan","doi":"arxiv-2409.11960","DOIUrl":"https://doi.org/arxiv-2409.11960","url":null,"abstract":"The current bottleneck in continuous sign language recognition (CSLR)\u0000research lies in the fact that most publicly available datasets are limited to\u0000laboratory environments or television program recordings, resulting in a single\u0000background environment with uniform lighting, which significantly deviates from\u0000the diversity and complexity found in real-life scenarios. To address this\u0000challenge, we have constructed a new, large-scale dataset for Chinese\u0000continuous sign language (CSL) based on complex environments, termed the\u0000complex environment - chinese sign language dataset (CE-CSL). This dataset\u0000encompasses 5,988 continuous CSL video clips collected from daily life scenes,\u0000featuring more than 70 different complex backgrounds to ensure\u0000representativeness and generalization capability. To tackle the impact of\u0000complex backgrounds on CSLR performance, we propose a time-frequency network\u0000(TFNet) model for continuous sign language recognition. This model extracts\u0000frame-level features and then utilizes both temporal and spectral information\u0000to separately derive sequence features before fusion, aiming to achieve\u0000efficient and accurate CSLR. Experimental results demonstrate that our approach\u0000achieves significant performance improvements on the CE-CSL, validating its\u0000effectiveness under complex background conditions. Additionally, our proposed\u0000method has also yielded highly competitive results when applied to three\u0000publicly available CSL datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ye Wang, Yaxiong Wang, Guoshuai Zhao, Xueming Qian
{"title":"Knowledge Adaptation Network for Few-Shot Class-Incremental Learning","authors":"Ye Wang, Yaxiong Wang, Guoshuai Zhao, Xueming Qian","doi":"arxiv-2409.11770","DOIUrl":"https://doi.org/arxiv-2409.11770","url":null,"abstract":"Few-shot class-incremental learning (FSCIL) aims to incrementally recognize\u0000new classes using a few samples while maintaining the performance on previously\u0000learned classes. One of the effective methods to solve this challenge is to\u0000construct prototypical evolution classifiers. Despite the advancement achieved\u0000by most existing methods, the classifier weights are simply initialized using\u0000mean features. Because representations for new classes are weak and biased, we\u0000argue such a strategy is suboptimal. In this paper, we tackle this issue from\u0000two aspects. Firstly, thanks to the development of foundation models, we employ\u0000a foundation model, the CLIP, as the network pedestal to provide a general\u0000representation for each class. Secondly, to generate a more reliable and\u0000comprehensive instance representation, we propose a Knowledge Adapter (KA)\u0000module that summarizes the data-specific knowledge from training data and fuses\u0000it into the general representation. Additionally, to tune the knowledge learned\u0000from the base classes to the upcoming classes, we propose a mechanism of\u0000Incremental Pseudo Episode Learning (IPEL) by simulating the actual FSCIL.\u0000Taken together, our proposed method, dubbed as Knowledge Adaptation Network\u0000(KANet), achieves competitive performance on a wide range of datasets,\u0000including CIFAR100, CUB200, and ImageNet-R.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras
{"title":"JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation","authors":"Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras","doi":"arxiv-2409.12156","DOIUrl":"https://doi.org/arxiv-2409.12156","url":null,"abstract":"We introduce a novel method for joint expression and audio-guided talking\u0000face generation. Recent approaches either struggle to preserve the speaker\u0000identity or fail to produce faithful facial expressions. To address these\u0000challenges, we propose a NeRF-based network. Since we train our network on\u0000monocular videos without any ground truth, it is essential to learn\u0000disentangled representations for audio and expression. We first learn audio\u0000features in a self-supervised manner, given utterances from multiple subjects.\u0000By incorporating a contrastive learning technique, we ensure that the learned\u0000audio features are aligned to the lip motion and disentangled from the muscle\u0000motion of the rest of the face. We then devise a transformer-based architecture\u0000that learns expression features, capturing long-range facial expressions and\u0000disentangling them from the speech-specific mouth movements. Through\u0000quantitative and qualitative evaluation, we demonstrate that our method can\u0000synthesize high-fidelity talking face videos, achieving state-of-the-art facial\u0000expression transfer along with lip synchronization to unseen audio.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund
{"title":"Agglomerative Token Clustering","authors":"Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund","doi":"arxiv-2409.11923","DOIUrl":"https://doi.org/arxiv-2409.11923","url":null,"abstract":"We present Agglomerative Token Clustering (ATC), a novel token merging method\u0000that consistently outperforms previous token merging and pruning methods across\u0000image classification, image synthesis, and object detection & segmentation\u0000tasks. ATC merges clusters through bottom-up hierarchical clustering, without\u0000the introduction of extra learnable parameters. We find that ATC achieves\u0000state-of-the-art performance across all tasks, and can even perform on par with\u0000prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning.\u0000ATC is particularly effective when applied with low keep rates, where only a\u0000small fraction of tokens are kept and retaining task performance is especially\u0000difficult.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}