arXiv - CS - Computer Vision and Pattern Recognition最新文献

筛选
英文 中文
FACT: Feature Adaptive Continual-learning Tracker for Multiple Object Tracking FACT:用于多目标跟踪的特征自适应持续学习跟踪器
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07904
Rongzihan Song, Zhenyu Weng, Huiping Zhuang, Jinchang Ren, Yongming Chen, Zhiping Lin
{"title":"FACT: Feature Adaptive Continual-learning Tracker for Multiple Object Tracking","authors":"Rongzihan Song, Zhenyu Weng, Huiping Zhuang, Jinchang Ren, Yongming Chen, Zhiping Lin","doi":"arxiv-2409.07904","DOIUrl":"https://doi.org/arxiv-2409.07904","url":null,"abstract":"Multiple object tracking (MOT) involves identifying multiple targets and\u0000assigning them corresponding IDs within a video sequence, where occlusions are\u0000often encountered. Recent methods address occlusions using appearance cues\u0000through online learning techniques to improve adaptivity or offline learning\u0000techniques to utilize temporal information from videos. However, most existing\u0000online learning-based MOT methods are unable to learn from all past tracking\u0000information to improve adaptivity on long-term occlusions while maintaining\u0000real-time tracking speed. On the other hand, temporal information-based offline\u0000learning methods maintain a long-term memory to store past tracking\u0000information, but this approach restricts them to use only local past\u0000information during tracking. To address these challenges, we propose a new MOT\u0000framework called the Feature Adaptive Continual-learning Tracker (FACT), which\u0000enables real-time tracking and feature learning for targets by utilizing all\u0000past tracking information. We demonstrate that the framework can be integrated\u0000with various state-of-the-art feature-based trackers, thereby improving their\u0000tracking ability. Specifically, we develop the feature adaptive\u0000continual-learning (FAC) module, a neural network that can be trained online to\u0000learn features adaptively using all past tracking information during tracking.\u0000Moreover, we also introduce a two-stage association module specifically\u0000designed for the proposed continual learning-based tracking. Extensive\u0000experiment results demonstrate that the proposed method achieves\u0000state-of-the-art online tracking performance on MOT17 and MOT20 benchmarks. The\u0000code will be released upon acceptance.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Expansive Supervision for Neural Radiance Field 神经辐射场的扩展监督
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08056
Weixiang Zhang, Shuzhao Xie, Shijia Ge, Wei Yao, Chen Tang, Zhi Wang
{"title":"Expansive Supervision for Neural Radiance Field","authors":"Weixiang Zhang, Shuzhao Xie, Shijia Ge, Wei Yao, Chen Tang, Zhi Wang","doi":"arxiv-2409.08056","DOIUrl":"https://doi.org/arxiv-2409.08056","url":null,"abstract":"Neural Radiance Fields have achieved success in creating powerful 3D media\u0000representations with their exceptional reconstruction capabilities. However,\u0000the computational demands of volume rendering pose significant challenges\u0000during model training. Existing acceleration techniques often involve\u0000redesigning the model architecture, leading to limitations in compatibility\u0000across different frameworks. Furthermore, these methods tend to overlook the\u0000substantial memory costs incurred. In response to these challenges, we\u0000introduce an expansive supervision mechanism that efficiently balances\u0000computational load, rendering quality and flexibility for neural radiance field\u0000training. This mechanism operates by selectively rendering a small but crucial\u0000subset of pixels and expanding their values to estimate the error across the\u0000entire area for each iteration. Compare to conventional supervision, our method\u0000effectively bypasses redundant rendering processes, resulting in notable\u0000reductions in both time and memory consumption. Experimental results\u0000demonstrate that integrating expansive supervision within existing\u0000state-of-the-art acceleration frameworks can achieve 69% memory savings and 42%\u0000time savings, with negligible compromise in visual quality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE ProbTalk3D:使用 VQ-VAE 进行非确定性情感可控语音驱动三维面部动画合成
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07966
Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak
{"title":"ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE","authors":"Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak","doi":"arxiv-2409.07966","DOIUrl":"https://doi.org/arxiv-2409.07966","url":null,"abstract":"Audio-driven 3D facial animation synthesis has been an active field of\u0000research with attention from both academia and industry. While there are\u0000promising results in this area, recent approaches largely focus on lip-sync and\u0000identity control, neglecting the role of emotions and emotion control in the\u0000generative process. That is mainly due to the lack of emotionally rich facial\u0000animation data and algorithms that can synthesize speech animations with\u0000emotional expressions at the same time. In addition, majority of the models are\u0000deterministic, meaning given the same audio input, they produce the same output\u0000motion. We argue that emotions and non-determinism are crucial to generate\u0000diverse and emotionally-rich facial animations. In this paper, we propose\u0000ProbTalk3D a non-deterministic neural network approach for emotion controllable\u0000speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and\u0000an emotionally rich facial animation dataset 3DMEAD. We provide an extensive\u0000comparative analysis of our model against the recent 3D facial animation\u0000synthesis approaches, by evaluating the results objectively, qualitatively, and\u0000with a perceptual user study. We highlight several objective metrics that are\u0000more suitable for evaluating stochastic outputs and use both in-the-wild and\u0000ground truth data for subjective evaluation. To our knowledge, that is the\u0000first non-deterministic 3D facial animation synthesis method incorporating a\u0000rich emotion dataset and emotion control with emotion labels and intensity\u0000levels. Our evaluation demonstrates that the proposed model achieves superior\u0000performance compared to state-of-the-art emotion-controlled, deterministic and\u0000non-deterministic models. We recommend watching the supplementary video for\u0000quality judgement. The entire codebase is publicly available\u0000(https://github.com/uuembodiedsocialai/ProbTalk3D/).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Self-Training for Semi-Supervised 3D Segmentation 用于半监督三维分割的贝叶斯自我训练技术
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08102
Ozan Unal, Christos Sakaridis, Luc Van Gool
{"title":"Bayesian Self-Training for Semi-Supervised 3D Segmentation","authors":"Ozan Unal, Christos Sakaridis, Luc Van Gool","doi":"arxiv-2409.08102","DOIUrl":"https://doi.org/arxiv-2409.08102","url":null,"abstract":"3D segmentation is a core problem in computer vision and, similarly to many\u0000other dense prediction tasks, it requires large amounts of annotated data for\u0000adequate training. However, densely labeling 3D point clouds to employ\u0000fully-supervised training remains too labor intensive and expensive.\u0000Semi-supervised training provides a more practical alternative, where only a\u0000small set of labeled data is given, accompanied by a larger unlabeled set. This\u0000area thus studies the effective use of unlabeled data to reduce the performance\u0000gap that arises due to the lack of annotations. In this work, inspired by\u0000Bayesian deep learning, we first propose a Bayesian self-training framework for\u0000semi-supervised 3D semantic segmentation. Employing stochastic inference, we\u0000generate an initial set of pseudo-labels and then filter these based on\u0000estimated point-wise uncertainty. By constructing a heuristic $n$-partite\u0000matching algorithm, we extend the method to semi-supervised 3D instance\u0000segmentation, and finally, with the same building blocks, to dense 3D visual\u0000grounding. We demonstrate state-of-the-art results for our semi-supervised\u0000method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on\u0000ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial\u0000improvements in dense 3D visual grounding over supervised-only baselines on\u0000ScanRefer. Our project page is available at ouenal.github.io/bst/.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor 按需深度:从低帧率有源传感器流式传输高密度深度数据
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08277
Andrea Conti, Matteo Poggi, Valerio Cambareri, Stefano Mattoccia
{"title":"Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor","authors":"Andrea Conti, Matteo Poggi, Valerio Cambareri, Stefano Mattoccia","doi":"arxiv-2409.08277","DOIUrl":"https://doi.org/arxiv-2409.08277","url":null,"abstract":"High frame rate and accurate depth estimation plays an important role in\u0000several tasks crucial to robotics and automotive perception. To date, this can\u0000be achieved through ToF and LiDAR devices for indoor and outdoor applications,\u0000respectively. However, their applicability is limited by low frame rate, energy\u0000consumption, and spatial sparsity. Depth on Demand (DoD) allows for accurate\u0000temporal and spatial depth densification achieved by exploiting a high frame\u0000rate RGB sensor coupled with a potentially lower frame rate and sparse active\u0000depth sensor. Our proposal jointly enables lower energy consumption and denser\u0000shape reconstruction, by significantly reducing the streaming requirements on\u0000the depth sensor thanks to its three core stages: i) multi-modal encoding, ii)\u0000iterative multi-modal integration, and iii) depth decoding. We present extended\u0000evidence assessing the effectiveness of DoD on indoor and outdoor video\u0000datasets, covering both environment scanning and automotive perception use\u0000cases.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation 通过即时插值进行噪声校正,实现基于扩散的图像间平移
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08077
Junsung Lee, Minsoo Kang, Bohyung Han
{"title":"Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation","authors":"Junsung Lee, Minsoo Kang, Bohyung Han","doi":"arxiv-2409.08077","DOIUrl":"https://doi.org/arxiv-2409.08077","url":null,"abstract":"We propose a simple but effective training-free approach tailored to\u0000diffusion-based image-to-image translation. Our approach revises the original\u0000noise prediction network of a pretrained diffusion model by introducing a noise\u0000correction term. We formulate the noise correction term as the difference\u0000between two noise predictions; one is computed from the denoising network with\u0000a progressive interpolation of the source and target prompt embeddings, while\u0000the other is the noise prediction with the source prompt embedding. The final\u0000noise prediction network is given by a linear combination of the standard\u0000denoising term and the noise correction term, where the former is designed to\u0000reconstruct must-be-preserved regions while the latter aims to effectively edit\u0000regions of interest relevant to the target prompt. Our approach can be easily\u0000incorporated into existing image-to-image translation methods based on\u0000diffusion models. Extensive experiments verify that the proposed technique\u0000achieves outstanding performance with low latency and consistently improves\u0000existing frameworks when combined with them.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction 深度高度解耦实现基于视觉的精确三维占位预测
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07972
Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang
{"title":"Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction","authors":"Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang","doi":"arxiv-2409.07972","DOIUrl":"https://doi.org/arxiv-2409.07972","url":null,"abstract":"The task of vision-based 3D occupancy prediction aims to reconstruct 3D\u0000geometry and estimate its semantic classes from 2D color images, where the\u00002D-to-3D view transformation is an indispensable step. Most previous methods\u0000conduct forward projection, such as BEVPooling and VoxelPooling, both of which\u0000map the 2D image features into 3D grids. However, the current grid representing\u0000features within a certain height range usually introduces many confusing\u0000features that belong to other height ranges. To address this challenge, we\u0000present Deep Height Decoupling (DHD), a novel framework that incorporates\u0000explicit height prior to filter out the confusing features. Specifically, DHD\u0000first predicts height maps via explicit supervision. Based on the height\u0000distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to\u0000adaptively decoupled the height map into multiple binary masks. MGHS projects\u0000the 2D image features into multiple subspaces, where each grid contains\u0000features within reasonable height ranges. Finally, a Synergistic Feature\u0000Aggregation (SFA) module is deployed to enhance the feature representation\u0000through channel and spatial affinities, enabling further occupancy refinement.\u0000On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art\u0000performance even with minimal input frames. Code is available at\u0000https://github.com/yanzq95/DHD.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality SimMAT:探索从视觉基础模型到任何图像模式的可移植性
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08083
Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang
{"title":"SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality","authors":"Chenyang Lei, Liyi Chen, Jun Cen, Xiao Chen, Zhen Lei, Felix Heide, Ziwei Liu, Qifeng Chen, Zhaoxiang Zhang","doi":"arxiv-2409.08083","DOIUrl":"https://doi.org/arxiv-2409.08083","url":null,"abstract":"Foundation models like ChatGPT and Sora that are trained on a huge scale of\u0000data have made a revolutionary social impact. However, it is extremely\u0000challenging for sensors in many different fields to collect similar scales of\u0000natural images to train strong foundation models. To this end, this work\u0000presents a simple and effective framework SimMAT to study an open problem: the\u0000transferability from vision foundation models trained on natural RGB images to\u0000other image modalities of different physical properties (e.g., polarization).\u0000SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained\u0000foundation model. We apply SimMAT to a representative vision foundation model\u0000Segment Anything Model (SAM) to support any evaluated new image modality. Given\u0000the absence of relevant benchmarks, we construct a new benchmark to evaluate\u0000the transfer learning performance. Our experiments confirm the intriguing\u0000potential of transferring vision foundation models in enhancing other sensors'\u0000performance. Specifically, SimMAT can improve the segmentation performance\u0000(mIoU) from 22.15% to 53.88% on average for evaluated modalities and\u0000consistently outperforms other baselines. We hope that SimMAT can raise\u0000awareness of cross-modal transfer learning and benefit various fields for\u0000better results with vision foundation models.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms 通过可学习的多尺度嵌入和注意力机制增强少镜头图像分类能力
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.07989
Fatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi
{"title":"Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms","authors":"Fatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi","doi":"arxiv-2409.07989","DOIUrl":"https://doi.org/arxiv-2409.07989","url":null,"abstract":"In the context of few-shot classification, the goal is to train a classifier\u0000using a limited number of samples while maintaining satisfactory performance.\u0000However, traditional metric-based methods exhibit certain limitations in\u0000achieving this objective. These methods typically rely on a single distance\u0000value between the query feature and support feature, thereby overlooking the\u0000contribution of shallow features. To overcome this challenge, we propose a\u0000novel approach in this paper. Our approach involves utilizing multi-output\u0000embedding network that maps samples into distinct feature spaces. The proposed\u0000method extract feature vectors at different stages, enabling the model to\u0000capture both global and abstract features. By utilizing these diverse feature\u0000spaces, our model enhances its performance. Moreover, employing a\u0000self-attention mechanism improves the refinement of features at each stage,\u0000leading to even more robust representations and improved overall performance.\u0000Furthermore, assigning learnable weights to each stage significantly improved\u0000performance and results. We conducted comprehensive evaluations on the\u0000MiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way\u00005-shot scenarios. Additionally, we performed a cross-domain task from\u0000MiniImageNet to the CUB dataset, achieving high accuracy in the testing domain.\u0000These evaluations demonstrate the efficacy of our proposed method in comparison\u0000to state-of-the-art approaches. https://github.com/FatemehAskari/MSENet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder TextBoost:通过微调文本编码器实现文本到图像模型的一次性个性化定制
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI: arxiv-2409.08248
NaHyeon Park, Kunhee Kim, Hyunjung Shim
{"title":"TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder","authors":"NaHyeon Park, Kunhee Kim, Hyunjung Shim","doi":"arxiv-2409.08248","DOIUrl":"https://doi.org/arxiv-2409.08248","url":null,"abstract":"Recent breakthroughs in text-to-image models have opened up promising\u0000research avenues in personalized image generation, enabling users to create\u0000diverse images of a specific subject using natural language prompts. However,\u0000existing methods often suffer from performance degradation when given only a\u0000single reference image. They tend to overfit the input, producing highly\u0000similar outputs regardless of the text prompt. This paper addresses the\u0000challenge of one-shot personalization by mitigating overfitting, enabling the\u0000creation of controllable images through text prompts. Specifically, we propose\u0000a selective fine-tuning strategy that focuses on the text encoder. Furthermore,\u0000we introduce three key techniques to enhance personalization performance: (1)\u0000augmentation tokens to encourage feature disentanglement and alleviate\u0000overfitting, (2) a knowledge-preservation loss to reduce language drift and\u0000promote generalizability across diverse prompts, and (3) SNR-weighted sampling\u0000for efficient training. Extensive experiments demonstrate that our approach\u0000efficiently generates high-quality, diverse images using only a single\u0000reference image while significantly reducing memory and storage requirements.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信