Computer Vision and Image Understanding最新文献

筛选
英文 中文
Seam estimation based on dense matching for parallax-tolerant image stitching 基于密集匹配的接缝估计,用于视差容忍图像拼接
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-04 DOI: 10.1016/j.cviu.2024.104219
Zhihao Zhang , Jie He , Mouquan Shen , Xianqiang Yang
{"title":"Seam estimation based on dense matching for parallax-tolerant image stitching","authors":"Zhihao Zhang ,&nbsp;Jie He ,&nbsp;Mouquan Shen ,&nbsp;Xianqiang Yang","doi":"10.1016/j.cviu.2024.104219","DOIUrl":"10.1016/j.cviu.2024.104219","url":null,"abstract":"<div><div>Image stitching with large parallax poses a significant challenge in the field of computer vision. Existing seam-based approaches attempt to address parallax artifacts by stitching images along seams. However, issues such as object mismatches, disappearances, and duplications still arise occasionally, primarily due to inaccurate alignment of dense pixels or inappropriate seam estimation methods. In this paper, we propose a robust seam-based parallax-tolerant image stitching method that leverages dense flow estimation from state-of-the-art approaches. Firstly, we develop a seam estimation method that does not require pre-estimation of image warping model. Instead, it directly estimates the seam by measuring the local smoothness of the optical flow field and incorporating a penalty term for duplications. Subsequently, we design an iterative algorithm that utilizes the location of estimated seam to solve a spatial smooth warping model and eliminate outlier corresponding pairs. By employing this approach, we effectively address the intertwined challenges of estimating the warping model and seam. Experiment on real-world images shows that our proposed method achieves superior local alignment accuracy near the stitching seam and outperforms other state-of-the-art techniques on visual stitching result. Code is available at <span><span>https://github.com/zhihao0512/dense-matching-image-stitching</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104219"},"PeriodicalIF":4.3,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Monocular depth estimation with boundary attention mechanism and Shifted Window Adaptive Bins 利用边界注意机制和移位窗口自适应分层进行单目深度估算
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-04 DOI: 10.1016/j.cviu.2024.104220
Hengjia Hu , Mengnan Liang , Congcong Wang, Meng Zhao, Fan Shi, Chao Zhang, Yilin Han
{"title":"Monocular depth estimation with boundary attention mechanism and Shifted Window Adaptive Bins","authors":"Hengjia Hu ,&nbsp;Mengnan Liang ,&nbsp;Congcong Wang,&nbsp;Meng Zhao,&nbsp;Fan Shi,&nbsp;Chao Zhang,&nbsp;Yilin Han","doi":"10.1016/j.cviu.2024.104220","DOIUrl":"10.1016/j.cviu.2024.104220","url":null,"abstract":"<div><div>Monocular depth estimation is a classic research topic in computer vision. In recent years, development of Convolutional Neural Networks (CNNs) has facilitated significant breakthroughs in this field. However, there still exist two challenges: (1) The network struggles to effectively fuse edge features in the feature fusion stage, which ultimately results in the loss of structure or boundary distortion of objects in the scene. (2) Classification based studies typically depend on Transformers for global modeling, a process that often introduces substantial computational complexity overhead as described in Equation 2. In this paper, we propose two modules to address the aforementioned issues. The first module is the Boundary Attention Module (BAM), which leverages the attention mechanism to enhance the ability of the network to perceive object boundaries during the feature fusion stage. In addition, to mitigate the computational complexity overhead resulting from predicting adaptive bins, we propose a Shift Window Adaptive Bins (SWAB) module to reduce the amount of computation in global modeling. The proposed method is evaluated on three public datasets, NYU Depth V2, KITTI and SUNRGB-D, and demonstrates state-of-the-art (SOTA) performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104220"},"PeriodicalIF":4.3,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multivariate prototype representation for domain-generalized incremental learning 用于领域泛化增量学习的多变量原型表示法
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-30 DOI: 10.1016/j.cviu.2024.104215
Can Peng , Piotr Koniusz , Kaiyu Guo , Brian C. Lovell , Peyman Moghadam
{"title":"Multivariate prototype representation for domain-generalized incremental learning","authors":"Can Peng ,&nbsp;Piotr Koniusz ,&nbsp;Kaiyu Guo ,&nbsp;Brian C. Lovell ,&nbsp;Peyman Moghadam","doi":"10.1016/j.cviu.2024.104215","DOIUrl":"10.1016/j.cviu.2024.104215","url":null,"abstract":"<div><div>Deep learning models often suffer from catastrophic forgetting when fine-tuned with samples of new classes. This issue becomes even more challenging when there is a domain shift between training and testing data. In this paper, we address the critical yet less explored Domain-Generalized Class-Incremental Learning (DGCIL) task. We propose a DGCIL approach designed to memorize old classes, adapt to new classes, and reliably classify objects from unseen domains. Specifically, our loss formulation maintains classification boundaries while suppressing domain-specific information for each class. Without storing old exemplars, we employ knowledge distillation and estimate the drift of old class prototypes as incremental training progresses. Our prototype representations are based on multivariate Normal distributions, with means and covariances continually adapted to reflect evolving model features, providing effective representations for old classes. We then sample pseudo-features for these old classes from the adapted Normal distributions using Cholesky decomposition. Unlike previous pseudo-feature sampling strategies that rely solely on average mean prototypes, our method captures richer semantic variations. Experiments on several benchmarks demonstrate the superior performance of our method compared to the state of the art.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104215"},"PeriodicalIF":4.3,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diffusion Models for Counterfactual Explanations 用于反事实解释的扩散模型
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-29 DOI: 10.1016/j.cviu.2024.104207
Guillaume Jeanneret, Loïc Simon, Frédéric Jurie
{"title":"Diffusion Models for Counterfactual Explanations","authors":"Guillaume Jeanneret,&nbsp;Loïc Simon,&nbsp;Frédéric Jurie","doi":"10.1016/j.cviu.2024.104207","DOIUrl":"10.1016/j.cviu.2024.104207","url":null,"abstract":"<div><div>Counterfactual explanations have demonstrated promising results as a post-hoc framework to improve the explanatory power of image classifiers. Herein, this paper proposes DiME, a method that allows the generation of counterfactual images using the latest diffusion models. The proposed method uses a guided generative diffusion process to exploit the gradients of the target classifier to generate counterfactual explanations of the input instances. Furthermore, we examine present strategies for assessing spurious correlations and expand the assessment methods by presenting a novel measure, Correlation Difference, which is more efficient at detecting such correlations. The provided work includes a comprehensive ablation study and a thorough experimental validation demonstrating that the proposed algorithm outperforms previous state-of-the-art results on the CelebA, CelebAHQ and BDD100k datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104207"},"PeriodicalIF":4.3,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3D scene generation for zero-shot learning using ChatGPT guided language prompts 利用 ChatGPT 引导式语言提示为零镜头学习生成 3D 场景
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-29 DOI: 10.1016/j.cviu.2024.104211
Sahar Ahmadi , Ali Cheraghian , Townim Faisal Chowdhury , Morteza Saberi , Shafin Rahman
{"title":"3D scene generation for zero-shot learning using ChatGPT guided language prompts","authors":"Sahar Ahmadi ,&nbsp;Ali Cheraghian ,&nbsp;Townim Faisal Chowdhury ,&nbsp;Morteza Saberi ,&nbsp;Shafin Rahman","doi":"10.1016/j.cviu.2024.104211","DOIUrl":"10.1016/j.cviu.2024.104211","url":null,"abstract":"<div><div>Zero-shot learning in the realm of 3D point cloud data remains relatively unexplored compared to its 2D image counterpart. This domain introduces fresh challenges due to the absence of robust pre-trained feature extraction models. To tackle this, we introduce a prompt-guided method for 3D scene generation and supervision, enhancing the network’s ability to comprehend the intricate relationships between seen and unseen objects. Initially, we utilize basic prompts resembling scene annotations generated from one or two point cloud objects. Recognizing the limited diversity of basic prompts, we employ ChatGPT to expand them, enriching the contextual information within the descriptions. Subsequently, leveraging these descriptions, we arrange point cloud objects’ coordinates to fabricate augmented 3D scenes. Lastly, employing contrastive learning, we train our proposed architecture end-to-end, utilizing pairs of 3D scenes and prompt-based captions. We posit that 3D scenes facilitate more efficient object relationships than individual objects, as demonstrated by the effectiveness of language models like BERT in contextual understanding. Our prompt-guided scene generation method amalgamates data augmentation and prompt-based annotation, thereby enhancing 3D ZSL performance. We present ZSL and generalized ZSL results on both synthetic (ModelNet40, ModelNet10, and ShapeNet) and real-scanned (ScanOjbectNN) 3D object datasets. Furthermore, we challenge the model by training with synthetic data and testing with real-scanned data, achieving state-of-the-art performance compared to existing 2D and 3D ZSL methods in the literature. Codes and models are available at: <span><span>https://github.com/saharahmadisohraviyeh/ChatGPT_ZSL_3D</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104211"},"PeriodicalIF":4.3,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A large corpus for the recognition of Greek Sign Language gestures 识别希腊手语手势的大型语料库
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-29 DOI: 10.1016/j.cviu.2024.104212
Katerina Papadimitriou , Galini Sapountzaki , Kyriaki Vasilaki , Eleni Efthimiou , Stavroula-Evita Fotinea , Gerasimos Potamianos
{"title":"A large corpus for the recognition of Greek Sign Language gestures","authors":"Katerina Papadimitriou ,&nbsp;Galini Sapountzaki ,&nbsp;Kyriaki Vasilaki ,&nbsp;Eleni Efthimiou ,&nbsp;Stavroula-Evita Fotinea ,&nbsp;Gerasimos Potamianos","doi":"10.1016/j.cviu.2024.104212","DOIUrl":"10.1016/j.cviu.2024.104212","url":null,"abstract":"<div><div>Sign language recognition (SLR) from videos constitutes a captivating problem in gesture recognition, requiring the interpretation of hand movements, facial expressions, and body postures. The complexity of sign formation, signing variability among signers, and the technical hurdles of visual detection and tracking render SLR a challenging task. At the same time, the scarcity of large-scale SLR datasets, which are critical for developing robust data-intensive deep-learning SLR models, exacerbates these issues. In this article, we introduce a multi-signer video corpus of Greek Sign Language (GSL), which is the largest GSL database to date, serving as a valuable resource for SLR research. This corpus comprises an extensive RGB+D video collection that conveys rich lexical content in a multi-modal fashion, encompassing three subsets: (i) isolated signs; (ii) continuous signing; and (iii) continuous alphabet fingerspelling of words. Moreover, we introduce a comprehensive experimental setup that paves the way for more accurate and robust SLR solutions. In particular, except for the multi-signer (MS) and signer-independent (SI) settings, we employ a signer-adapted (SA) experimental paradigm, facilitating a comprehensive evaluation of system performance across various scenarios. Further, we provide three baseline SLR systems for isolated signs, continuous signing, and continuous fingerspelling. These systems leverage cutting-edge methods in deep learning and sequence modeling to capture the intricate temporal dynamics inherent in sign gestures. The models are evaluated on the three corpus subsets, setting their state-of-the-art recognition benchmark. The SL-ReDu GSL corpus, including its recommended experimental frameworks, is publicly available at <span><span>https://sl-redu.e-ce.uth.gr/corpus</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104212"},"PeriodicalIF":4.3,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image compressive sensing reconstruction via nonlocal low-rank residual-based ADMM framework 通过基于非局部低阶残差的 ADMM 框架进行图像压缩传感重建
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-28 DOI: 10.1016/j.cviu.2024.104204
Junhao Zhang , Kim-Hui Yap , Lap-Pui Chau , Ce Zhu
{"title":"Image compressive sensing reconstruction via nonlocal low-rank residual-based ADMM framework","authors":"Junhao Zhang ,&nbsp;Kim-Hui Yap ,&nbsp;Lap-Pui Chau ,&nbsp;Ce Zhu","doi":"10.1016/j.cviu.2024.104204","DOIUrl":"10.1016/j.cviu.2024.104204","url":null,"abstract":"<div><div>The nonlocal low-rank (LR) modeling has proven to be an effective approach in image compressive sensing (CS) reconstruction, which starts by clustering similar patches using the nonlocal self-similarity (NSS) prior into nonlocal image group and then imposes an LR penalty on each nonlocal image group. However, most existing methods only approximate the LR matrix directly from the degraded nonlocal image group, which may lead to suboptimal LR matrix approximation and thus obtain unsatisfactory reconstruction results. In this paper, we propose a novel nonlocal low-rank residual (NLRR) approach for image CS reconstruction, which progressively approximates the underlying LR matrix by minimizing the LR residual. To do this, we first use the NSS prior to obtaining a good estimate of the original nonlocal image group, and then the LR residual between the degraded nonlocal image group and the estimated nonlocal image group is minimized to derive a more accurate LR matrix. To ensure the optimization is both feasible and reliable, we employ an alternative direction multiplier method (ADMM) to solve the NLRR-based image CS reconstruction problem. Our experimental results show that the proposed NLRR algorithm achieves superior performance against many popular or state-of-the-art image CS reconstruction methods, both in objective metrics and subjective perceptual quality.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104204"},"PeriodicalIF":4.3,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A MLP architecture fusing RGB and CASSI for computational spectral imaging 融合 RGB 和 CASSI 的 MLP 架构,用于计算光谱成像
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-25 DOI: 10.1016/j.cviu.2024.104214
Zeyu Cai , Ru Hong , Xun Lin , Jiming Yang , YouLiang Ni , Zhen Liu , Chengqian Jin , Feipeng Da
{"title":"A MLP architecture fusing RGB and CASSI for computational spectral imaging","authors":"Zeyu Cai ,&nbsp;Ru Hong ,&nbsp;Xun Lin ,&nbsp;Jiming Yang ,&nbsp;YouLiang Ni ,&nbsp;Zhen Liu ,&nbsp;Chengqian Jin ,&nbsp;Feipeng Da","doi":"10.1016/j.cviu.2024.104214","DOIUrl":"10.1016/j.cviu.2024.104214","url":null,"abstract":"<div><div>The coded Aperture Snapshot Spectral Imaging (CASSI) system offers significant advantages in dynamically acquiring hyper-spectral images compared to traditional measurement methods. However, it faces the following challenges: (1) Traditional masks rely on random patterns or analytical design, limiting CASSI’s performance improvement. (2) Existing CASSI reconstruction algorithms do not fully utilize RGB information. (3) High-quality reconstruction algorithms are often slow and limited to offline scene reconstruction. To address these issues, this paper proposes a new MLP architecture, Spectral–Spatial MLP (SSMLP), which replaces the transformer structure with a network using CASSI measurements and RGB as multimodal inputs. This maintains reconstruction quality while significantly improving reconstruction speed. Additionally, we constructed a teacher-student network (SSMLP with a teacher, SSMLP-WT) to transfer the knowledge learned from a large model to a smaller network, further enhancing the smaller network’s accuracy. Extensive experiments show that SSMLP matches the performance of transformer-based structures in spectral image reconstruction while improving inference speed by at least 50%. The reconstruction quality of SSMLP-WT is further improved by knowledge transfer without changing the network, and the teacher boosts the performance by 0.92 dB (44.73 dB vs. 43.81 dB).</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104214"},"PeriodicalIF":4.3,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A GCN and Transformer complementary network for skeleton-based action recognition 用于基于骨骼的动作识别的 GCN 和 Transformer 互补网络
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-22 DOI: 10.1016/j.cviu.2024.104213
Xuezhi Xiang , Xiaoheng Li , Xuzhao Liu , Yulong Qiao , Abdulmotaleb El Saddik
{"title":"A GCN and Transformer complementary network for skeleton-based action recognition","authors":"Xuezhi Xiang ,&nbsp;Xiaoheng Li ,&nbsp;Xuzhao Liu ,&nbsp;Yulong Qiao ,&nbsp;Abdulmotaleb El Saddik","doi":"10.1016/j.cviu.2024.104213","DOIUrl":"10.1016/j.cviu.2024.104213","url":null,"abstract":"<div><div>Graph Convolution Networks (GCNs) have been widely used in skeleton-based action recognition. Although there are significant progress, the inherent limitation still lies in the restricted receptive field of GCN, hindering its ability to extract global dependencies effectively. And the joints that are structurally separated can also have strong correlation. Previous works rarely explore local and global correlations of joints, leading to insufficiently model the complex dynamics of skeleton sequences. To address this issue, we propose a GCN and Transformer complementary network (GTC-Net) that allows parallel communications between GCN and Transformer domains. Specifically, we introduce a graph convolution and self-attention combined module (GAM), which can effectively leverage the complementarity of GCN and self-attention to perceive local and global dependencies of joints for the human body. Furthermore, in order to address the problems of long-term sequence ordering and position detection, we design a position-aware module (PAM), which can explicitly capture the ordering information and unique identity information for body joints of skeleton sequence. Extensive experiments on NTU RGB+D 60 and NTU RGB+D 120 datasets are conducted to evaluate our proposed method. The results demonstrate that our method can achieve competitive results on both datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104213"},"PeriodicalIF":4.3,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reverse Stable Diffusion: What prompt was used to generate this image? 反向稳定扩散:生成这张图片时使用了什么提示?
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-10-19 DOI: 10.1016/j.cviu.2024.104210
Florinel-Alin Croitoru , Vlad Hondru , Radu Tudor Ionescu , Mubarak Shah
{"title":"Reverse Stable Diffusion: What prompt was used to generate this image?","authors":"Florinel-Alin Croitoru ,&nbsp;Vlad Hondru ,&nbsp;Radu Tudor Ionescu ,&nbsp;Mubarak Shah","doi":"10.1016/j.cviu.2024.104210","DOIUrl":"10.1016/j.cviu.2024.104210","url":null,"abstract":"<div><div>Text-to-image diffusion models have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we study the task of predicting the prompt embedding given an image generated by a generative diffusion model. We consider a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (<em>i</em>.<em>e</em>. that are better aligned). We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation. Our code is publicly available for download at <span><span>https://github.com/CroitoruAlin/Reverse-Stable-Diffusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104210"},"PeriodicalIF":4.3,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142528466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信