Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu
{"title":"What Makes a Maze Look Like a Maze?","authors":"Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu","doi":"arxiv-2409.08202","DOIUrl":"https://doi.org/arxiv-2409.08202","url":null,"abstract":"A unique aspect of human visual understanding is the ability to flexibly\u0000interpret abstract concepts: acquiring lifted rules explaining what they\u0000symbolize, grounding them across familiar and unfamiliar contexts, and making\u0000predictions or reasoning about them. While off-the-shelf vision-language models\u0000excel at making literal interpretations of images (e.g., recognizing object\u0000categories such as tree branches), they still struggle to make sense of such\u0000visual abstractions (e.g., how an arrangement of tree branches may form the\u0000walls of a maze). To address this challenge, we introduce Deep Schema Grounding\u0000(DSG), a framework that leverages explicit structured representations of visual\u0000abstractions for grounding and reasoning. At the core of DSG are\u0000schemas--dependency graph descriptions of abstract concepts that decompose them\u0000into more primitive-level symbols. DSG uses large language models to extract\u0000schemas, then hierarchically grounds concrete to abstract components of the\u0000schema onto images with vision-language models. The grounded schema is used to\u0000augment visual abstraction understanding. We systematically evaluate DSG and\u0000different methods in reasoning on our new Visual Abstractions Dataset, which\u0000consists of diverse, real-world images of abstract concepts and corresponding\u0000question-answer pairs labeled by humans. We show that DSG significantly\u0000improves the abstract visual reasoning performance of vision-language models,\u0000and is a step toward human-aligned understanding of visual abstractions.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongyu Li, Tianrui Hui, Zihan Ding, Jing Zhang, Bin Ma, Xiaoming Wei, Jizhong Han, Si Liu
{"title":"Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding","authors":"Hongyu Li, Tianrui Hui, Zihan Ding, Jing Zhang, Bin Ma, Xiaoming Wei, Jizhong Han, Si Liu","doi":"arxiv-2409.08251","DOIUrl":"https://doi.org/arxiv-2409.08251","url":null,"abstract":"Panoptic narrative grounding (PNG), whose core target is fine-grained\u0000image-text alignment, requires a panoptic segmentation of referred objects\u0000given a narrative caption. Previous discriminative methods achieve only weak or\u0000coarse-grained alignment by panoptic segmentation pretraining or CLIP model\u0000adaptation. Given the recent progress of text-to-image Diffusion models,\u0000several works have shown their capability to achieve fine-grained image-text\u0000alignment through cross-attention maps and improved general segmentation\u0000performance. However, the direct use of phrase features as static prompts to\u0000apply frozen Diffusion models to the PNG task still suffers from a large task\u0000gap and insufficient vision-language interaction, yielding inferior\u0000performance. Therefore, we propose an Extractive-Injective Phrase Adapter\u0000(EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts\u0000with image features and inject the multimodal cues back, which leverages the\u0000fine-grained image-text alignment capability of Diffusion models more\u0000sufficiently. In addition, we also design a Multi-Level Mutual Aggregation\u0000(MLMA) module to reciprocally fuse multi-level image and phrase features for\u0000segmentation refinement. Extensive experiments on the PNG benchmark show that\u0000our method achieves new state-of-the-art performance.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kelian Baert, Shrisha Bharadwaj, Fabien Castan, Benoit Maujean, Marc Christie, Victoria Abrevaya, Adnane Boukhayma
{"title":"SPARK: Self-supervised Personalized Real-time Monocular Face Capture","authors":"Kelian Baert, Shrisha Bharadwaj, Fabien Castan, Benoit Maujean, Marc Christie, Victoria Abrevaya, Adnane Boukhayma","doi":"arxiv-2409.07984","DOIUrl":"https://doi.org/arxiv-2409.07984","url":null,"abstract":"Feedforward monocular face capture methods seek to reconstruct posed faces\u0000from a single image of a person. Current state of the art approaches have the\u0000ability to regress parametric 3D face models in real-time across a wide range\u0000of identities, lighting conditions and poses by leveraging large image datasets\u0000of human faces. These methods however suffer from clear limitations in that the\u0000underlying parametric face model only provides a coarse estimation of the face\u0000shape, thereby limiting their practical applicability in tasks that require\u0000precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this\u0000paper, we propose a method for high-precision 3D face capture taking advantage\u0000of a collection of unconstrained videos of a subject as prior information. Our\u0000proposal builds on a two stage approach. We start with the reconstruction of a\u0000detailed 3D face avatar of the person, capturing both precise geometry and\u0000appearance from a collection of videos. We then use the encoder from a\u0000pre-trained monocular face reconstruction method, substituting its decoder with\u0000our personalized model, and proceed with transfer learning on the video\u0000collection. Using our pre-estimated image formation model, we obtain a more\u0000precise self-supervision objective, enabling improved expression and pose\u0000alignment. This results in a trained encoder capable of efficiently regressing\u0000pose and expression parameters in real-time from previously unseen images,\u0000which combined with our personalized geometry model yields more accurate and\u0000high fidelity mesh inference. Through extensive qualitative and quantitative\u0000evaluation, we showcase the superiority of our final model as compared to\u0000state-of-the-art baselines, and demonstrate its generalization ability to\u0000unseen pose, expression and lighting.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FACT: Feature Adaptive Continual-learning Tracker for Multiple Object Tracking","authors":"Rongzihan Song, Zhenyu Weng, Huiping Zhuang, Jinchang Ren, Yongming Chen, Zhiping Lin","doi":"arxiv-2409.07904","DOIUrl":"https://doi.org/arxiv-2409.07904","url":null,"abstract":"Multiple object tracking (MOT) involves identifying multiple targets and\u0000assigning them corresponding IDs within a video sequence, where occlusions are\u0000often encountered. Recent methods address occlusions using appearance cues\u0000through online learning techniques to improve adaptivity or offline learning\u0000techniques to utilize temporal information from videos. However, most existing\u0000online learning-based MOT methods are unable to learn from all past tracking\u0000information to improve adaptivity on long-term occlusions while maintaining\u0000real-time tracking speed. On the other hand, temporal information-based offline\u0000learning methods maintain a long-term memory to store past tracking\u0000information, but this approach restricts them to use only local past\u0000information during tracking. To address these challenges, we propose a new MOT\u0000framework called the Feature Adaptive Continual-learning Tracker (FACT), which\u0000enables real-time tracking and feature learning for targets by utilizing all\u0000past tracking information. We demonstrate that the framework can be integrated\u0000with various state-of-the-art feature-based trackers, thereby improving their\u0000tracking ability. Specifically, we develop the feature adaptive\u0000continual-learning (FAC) module, a neural network that can be trained online to\u0000learn features adaptively using all past tracking information during tracking.\u0000Moreover, we also introduce a two-stage association module specifically\u0000designed for the proposed continual learning-based tracking. Extensive\u0000experiment results demonstrate that the proposed method achieves\u0000state-of-the-art online tracking performance on MOT17 and MOT20 benchmarks. The\u0000code will be released upon acceptance.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Expansive Supervision for Neural Radiance Field","authors":"Weixiang Zhang, Shuzhao Xie, Shijia Ge, Wei Yao, Chen Tang, Zhi Wang","doi":"arxiv-2409.08056","DOIUrl":"https://doi.org/arxiv-2409.08056","url":null,"abstract":"Neural Radiance Fields have achieved success in creating powerful 3D media\u0000representations with their exceptional reconstruction capabilities. However,\u0000the computational demands of volume rendering pose significant challenges\u0000during model training. Existing acceleration techniques often involve\u0000redesigning the model architecture, leading to limitations in compatibility\u0000across different frameworks. Furthermore, these methods tend to overlook the\u0000substantial memory costs incurred. In response to these challenges, we\u0000introduce an expansive supervision mechanism that efficiently balances\u0000computational load, rendering quality and flexibility for neural radiance field\u0000training. This mechanism operates by selectively rendering a small but crucial\u0000subset of pixels and expanding their values to estimate the error across the\u0000entire area for each iteration. Compare to conventional supervision, our method\u0000effectively bypasses redundant rendering processes, resulting in notable\u0000reductions in both time and memory consumption. Experimental results\u0000demonstrate that integrating expansive supervision within existing\u0000state-of-the-art acceleration frameworks can achieve 69% memory savings and 42%\u0000time savings, with negligible compromise in visual quality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE","authors":"Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak","doi":"arxiv-2409.07966","DOIUrl":"https://doi.org/arxiv-2409.07966","url":null,"abstract":"Audio-driven 3D facial animation synthesis has been an active field of\u0000research with attention from both academia and industry. While there are\u0000promising results in this area, recent approaches largely focus on lip-sync and\u0000identity control, neglecting the role of emotions and emotion control in the\u0000generative process. That is mainly due to the lack of emotionally rich facial\u0000animation data and algorithms that can synthesize speech animations with\u0000emotional expressions at the same time. In addition, majority of the models are\u0000deterministic, meaning given the same audio input, they produce the same output\u0000motion. We argue that emotions and non-determinism are crucial to generate\u0000diverse and emotionally-rich facial animations. In this paper, we propose\u0000ProbTalk3D a non-deterministic neural network approach for emotion controllable\u0000speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and\u0000an emotionally rich facial animation dataset 3DMEAD. We provide an extensive\u0000comparative analysis of our model against the recent 3D facial animation\u0000synthesis approaches, by evaluating the results objectively, qualitatively, and\u0000with a perceptual user study. We highlight several objective metrics that are\u0000more suitable for evaluating stochastic outputs and use both in-the-wild and\u0000ground truth data for subjective evaluation. To our knowledge, that is the\u0000first non-deterministic 3D facial animation synthesis method incorporating a\u0000rich emotion dataset and emotion control with emotion labels and intensity\u0000levels. Our evaluation demonstrates that the proposed model achieves superior\u0000performance compared to state-of-the-art emotion-controlled, deterministic and\u0000non-deterministic models. We recommend watching the supplementary video for\u0000quality judgement. The entire codebase is publicly available\u0000(https://github.com/uuembodiedsocialai/ProbTalk3D/).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bayesian Self-Training for Semi-Supervised 3D Segmentation","authors":"Ozan Unal, Christos Sakaridis, Luc Van Gool","doi":"arxiv-2409.08102","DOIUrl":"https://doi.org/arxiv-2409.08102","url":null,"abstract":"3D segmentation is a core problem in computer vision and, similarly to many\u0000other dense prediction tasks, it requires large amounts of annotated data for\u0000adequate training. However, densely labeling 3D point clouds to employ\u0000fully-supervised training remains too labor intensive and expensive.\u0000Semi-supervised training provides a more practical alternative, where only a\u0000small set of labeled data is given, accompanied by a larger unlabeled set. This\u0000area thus studies the effective use of unlabeled data to reduce the performance\u0000gap that arises due to the lack of annotations. In this work, inspired by\u0000Bayesian deep learning, we first propose a Bayesian self-training framework for\u0000semi-supervised 3D semantic segmentation. Employing stochastic inference, we\u0000generate an initial set of pseudo-labels and then filter these based on\u0000estimated point-wise uncertainty. By constructing a heuristic $n$-partite\u0000matching algorithm, we extend the method to semi-supervised 3D instance\u0000segmentation, and finally, with the same building blocks, to dense 3D visual\u0000grounding. We demonstrate state-of-the-art results for our semi-supervised\u0000method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on\u0000ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial\u0000improvements in dense 3D visual grounding over supervised-only baselines on\u0000ScanRefer. Our project page is available at ouenal.github.io/bst/.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Conti, Matteo Poggi, Valerio Cambareri, Stefano Mattoccia
{"title":"Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor","authors":"Andrea Conti, Matteo Poggi, Valerio Cambareri, Stefano Mattoccia","doi":"arxiv-2409.08277","DOIUrl":"https://doi.org/arxiv-2409.08277","url":null,"abstract":"High frame rate and accurate depth estimation plays an important role in\u0000several tasks crucial to robotics and automotive perception. To date, this can\u0000be achieved through ToF and LiDAR devices for indoor and outdoor applications,\u0000respectively. However, their applicability is limited by low frame rate, energy\u0000consumption, and spatial sparsity. Depth on Demand (DoD) allows for accurate\u0000temporal and spatial depth densification achieved by exploiting a high frame\u0000rate RGB sensor coupled with a potentially lower frame rate and sparse active\u0000depth sensor. Our proposal jointly enables lower energy consumption and denser\u0000shape reconstruction, by significantly reducing the streaming requirements on\u0000the depth sensor thanks to its three core stages: i) multi-modal encoding, ii)\u0000iterative multi-modal integration, and iii) depth decoding. We present extended\u0000evidence assessing the effectiveness of DoD on indoor and outdoor video\u0000datasets, covering both environment scanning and automotive perception use\u0000cases.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation","authors":"Junsung Lee, Minsoo Kang, Bohyung Han","doi":"arxiv-2409.08077","DOIUrl":"https://doi.org/arxiv-2409.08077","url":null,"abstract":"We propose a simple but effective training-free approach tailored to\u0000diffusion-based image-to-image translation. Our approach revises the original\u0000noise prediction network of a pretrained diffusion model by introducing a noise\u0000correction term. We formulate the noise correction term as the difference\u0000between two noise predictions; one is computed from the denoising network with\u0000a progressive interpolation of the source and target prompt embeddings, while\u0000the other is the noise prediction with the source prompt embedding. The final\u0000noise prediction network is given by a linear combination of the standard\u0000denoising term and the noise correction term, where the former is designed to\u0000reconstruct must-be-preserved regions while the latter aims to effectively edit\u0000regions of interest relevant to the target prompt. Our approach can be easily\u0000incorporated into existing image-to-image translation methods based on\u0000diffusion models. Extensive experiments verify that the proposed technique\u0000achieves outstanding performance with low latency and consistently improves\u0000existing frameworks when combined with them.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang
{"title":"Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction","authors":"Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang","doi":"arxiv-2409.07972","DOIUrl":"https://doi.org/arxiv-2409.07972","url":null,"abstract":"The task of vision-based 3D occupancy prediction aims to reconstruct 3D\u0000geometry and estimate its semantic classes from 2D color images, where the\u00002D-to-3D view transformation is an indispensable step. Most previous methods\u0000conduct forward projection, such as BEVPooling and VoxelPooling, both of which\u0000map the 2D image features into 3D grids. However, the current grid representing\u0000features within a certain height range usually introduces many confusing\u0000features that belong to other height ranges. To address this challenge, we\u0000present Deep Height Decoupling (DHD), a novel framework that incorporates\u0000explicit height prior to filter out the confusing features. Specifically, DHD\u0000first predicts height maps via explicit supervision. Based on the height\u0000distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to\u0000adaptively decoupled the height map into multiple binary masks. MGHS projects\u0000the 2D image features into multiple subspaces, where each grid contains\u0000features within reasonable height ranges. Finally, a Synergistic Feature\u0000Aggregation (SFA) module is deployed to enhance the feature representation\u0000through channel and spatial affinities, enabling further occupancy refinement.\u0000On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art\u0000performance even with minimal input frames. Code is available at\u0000https://github.com/yanzq95/DHD.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}