Computer Vision and Image Understanding最新文献

筛选
英文 中文
Font transformer for few-shot font generation 用于生成少量字体的字体转换器
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-05-24 DOI: 10.1016/j.cviu.2024.104043
Xu Chen, Lei Wu, Yongliang Su, Lei Meng, Xiangxu Meng
{"title":"Font transformer for few-shot font generation","authors":"Xu Chen,&nbsp;Lei Wu,&nbsp;Yongliang Su,&nbsp;Lei Meng,&nbsp;Xiangxu Meng","doi":"10.1016/j.cviu.2024.104043","DOIUrl":"10.1016/j.cviu.2024.104043","url":null,"abstract":"<div><p>Automatic font generation is of great benefit to improving the efficiency of font designers. Few-shot font generation aims to generate new fonts from a few reference samples, and has recently attracted a lot of attention from researchers. This is valuable but challenging, especially for ideograms with high diversity and complex structures. Existing models based on convolutional neural networks (CNNs) struggle to generate glyphs with accurate font style and stroke details in the few-shot setting. This paper proposes the TransFont, exploiting the long-range dependency modeling ability of the Vision Transformer (ViT) for few-shot font generation. For the first time, we empirically show that the ViT is better at glyph image generation than CNNs. Furthermore, based on the observation of the high redundancy in the glyph feature map, we introduce the glyph self-attention module for mitigating the quadratic computational and memory complexity of the pixel-level glyph image generation, along with several new techniques, i.e., multi-head multiple sampling, yz axis convolution, and approximate relative position bias. Extensive experiments on two Chinese font libraries show the superiority of our method over existing CNN-based font generation models, the proposed TransFont generates glyph images with more accurate font style and stroke details.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141141764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving semantic video retrieval models by training with a relevance-aware online mining strategy 利用相关性感知在线挖掘策略进行训练,改进语义视频检索模型
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-05-20 DOI: 10.1016/j.cviu.2024.104035
Alex Falcon , Giuseppe Serra , Oswald Lanz
{"title":"Improving semantic video retrieval models by training with a relevance-aware online mining strategy","authors":"Alex Falcon ,&nbsp;Giuseppe Serra ,&nbsp;Oswald Lanz","doi":"10.1016/j.cviu.2024.104035","DOIUrl":"10.1016/j.cviu.2024.104035","url":null,"abstract":"<div><p>To retrieve a video via a multimedia search engine, a textual query is usually created by the user and then used to perform the search. Recent state-of-the-art cross-modal retrieval methods learn a joint text–video embedding space by using contrastive loss functions, which maximize the similarity of <em>positive</em> pairs while decreasing that of the <em>negative</em> pairs. Although the choice of these pairs is fundamental for the construction of the joint embedding space, the selection procedure is usually driven by the relationships found within the dataset: a positive pair is commonly formed by a video and its own caption, whereas unrelated video-caption pairs represent the negative ones. We hypothesize that this choice results in a retrieval system with limited semantics understanding, as the standard training procedure requires the system to discriminate between groundtruth and negative even though there is no difference in their semantics. Therefore, differently from the previous approaches, in this paper we propose a novel strategy for the selection of both positive and negative pairs which takes into account both the annotations and the semantic contents of the captions. By doing so, the selected negatives do not share semantic concepts with the positive pair anymore, and it is also possible to discover new positives within the dataset. Based on our hypothesis, we provide a novel design of two popular contrastive loss functions, and explore their effectiveness on four heterogeneous state-of-the-art approaches. The extensive experimental analysis conducted on four datasets, EPIC-Kitchens-100, MSR-VTT, MSVD, and Charades, validates the effectiveness of the proposed strategy, observing, e.g., more than +20% nDCG on EPIC-Kitchens-100. Furthermore, these results are corroborated with qualitative evidence both supporting our hypothesis and explaining why the proposed strategy effectively overcomes it.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001164/pdfft?md5=e9059d9fba16e21f7573ef224c40196d&pid=1-s2.0-S1077314224001164-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141143632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Text to image synthesis with multi-granularity feature aware enhancement Generative Adversarial Networks 利用多粒度特征感知增强生成式对抗网络进行文本到图像的合成
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-05-20 DOI: 10.1016/j.cviu.2024.104042
Pei Dong, Lei Wu, Ruichen Li, Xiangxu Meng, Lei Meng
{"title":"Text to image synthesis with multi-granularity feature aware enhancement Generative Adversarial Networks","authors":"Pei Dong,&nbsp;Lei Wu,&nbsp;Ruichen Li,&nbsp;Xiangxu Meng,&nbsp;Lei Meng","doi":"10.1016/j.cviu.2024.104042","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104042","url":null,"abstract":"<div><p>Synthesizing complex images from text presents challenging. Compared to autoregressive and diffusion model-based methods, Generative Adversarial Network-based methods have significant advantages in terms of computational cost and generation efficiency yet remain two limitations: first, these methods often refine all features output from the previous stage indiscriminately, without considering these features are initialized gradually during the generation process; second, the sparse semantic constraints provided by the text description are typically ineffective for refining fine-grained features. These issues complicate the balance between generation quality, computational cost and inference speed. To address these issues, we propose a Multi-granularity Feature Aware Enhancement GAN (MFAE-GAN), which allows the refinement process to match the order of different granularity features being initialized. Specifically, MFAE-GAN (1) samples category-related coarse-grained features and instance-level detail-related fine-grained features at different generation stages based on different attention mechanisms in Coarse-grained Feature Enhancement (CFE) and Fine-grained Feature Enhancement (FFE) to guide the generation process spatially, (2) provides denser semantic constraints than textual semantic information through Multi-granularity Features Adaptive Batch Normalization (MFA-BN) in the process of refining fine-grained features, and (3) adopts a Global Semantics Preservation (GSP) to avoid the loss of global semantics when sampling features continuously. Extensive experimental results demonstrate that our MFAE-GAN is competitive in terms of both image generation quality and efficiency.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141097597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Complete contextual information extraction for self-supervised monocular depth estimation 用于自监督单目深度估计的完整上下文信息提取
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-05-15 DOI: 10.1016/j.cviu.2024.104032
Dazheng Zhou , Mingliang Zhang , Xianjie Gao , Youmei Zhang , Bin Li
{"title":"Complete contextual information extraction for self-supervised monocular depth estimation","authors":"Dazheng Zhou ,&nbsp;Mingliang Zhang ,&nbsp;Xianjie Gao ,&nbsp;Youmei Zhang ,&nbsp;Bin Li","doi":"10.1016/j.cviu.2024.104032","DOIUrl":"10.1016/j.cviu.2024.104032","url":null,"abstract":"<div><p>Self-supervised learning methods are increasingly important for monocular depth estimation since they do not require ground-truth data during training. Although existing methods have achieved great success for better monocular depth estimation based on Convolutional Neural Networks (CNNs), the limited receptive field of CNNs usually is insufficient to effectively model the global information, e.g., relationship between foreground and background or relationship among objects, which are crucial for accurately capturing scene structure. Recently, some studies based on Transformers have attracted significant interest in computer vision. However, duo to the lack of spatial locality bias, they may fail to model the local information, e.g., fine-grained details with an image. To tackle these issues, we propose a novel self-supervised learning framework by incorporating the advantages of both the CNNs and Transformers so as to model the complete contextual information for high-quality monocular depth estimation. Specifically, the proposed method mainly includes two branches, where the Transformer branch is considered to capture the global information while the Convolution branch is exploited to preserve the local information. We also design a rectangle convolution module with pyramid structure to perceive the semi-global information, e.g. thin objects, along the horizontal and vertical directions within an image. Moreover, we propose a shape refinement module by learning the affinity matrix between pixel and its neighborhood to obtain accurate geometrical structure of scenes. Extensive experiments evaluated on KITTI, Cityscapes and Make3D dataset demonstrate that the proposed method achieves the competitive result compared with the state-of-the-art self-supervised monocular depth estimation methods and shows good cross-dataset generalization ability.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141023280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Digital image defogging using joint Retinex theory and independent component analysis 利用联合 Retinex 理论和独立成分分析法进行数字图像除雾
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-05-14 DOI: 10.1016/j.cviu.2024.104033
Hossein Noori , Mohammad Hossein Gholizadeh , Hossein Khodabakhshi Rafsanjani
{"title":"Digital image defogging using joint Retinex theory and independent component analysis","authors":"Hossein Noori ,&nbsp;Mohammad Hossein Gholizadeh ,&nbsp;Hossein Khodabakhshi Rafsanjani","doi":"10.1016/j.cviu.2024.104033","DOIUrl":"10.1016/j.cviu.2024.104033","url":null,"abstract":"<div><p>The images captured under adverse weather conditions suffer from poor visibility and contrast problems. Such images are not suitable for computer vision analysis and similar applications. Therefore, image defogging/dehazing is one of the most intriguing topics. In this paper, a new, fast, and robust defogging/de-hazing algorithm is proposed by combining the Retinex theory with independent component analysis, which performs better than existing algorithms. Initially, the foggy image is decomposed into two components: reflectance and luminance. The former is computed using the Retinex theory, while the latter is obtained by decomposing the foggy image into parallel and perpendicular components of air-light. Finally, the defogged image is obtained by applying Koschmieder’s law. Simulation results demonstrate the absence of halo effects and the presence of high-resolution images. The simulation results also confirm the effectiveness of the proposed method when compared to other conventional techniques in terms of NIQE, FADE, SSIM, PSNR, AG, CIEDE2000, <span><math><mover><mrow><mi>r</mi></mrow><mrow><mo>̄</mo></mrow></mover></math></span>, and implementation time. All foggy and defogged results are available in high quality at the following link: <span>https://drive.google.com/file/d/1OStXrfzdnF43gr6PAnBd8BHeThOfj33z/view?usp=drive_link</span><svg><path></path></svg>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141035541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Take a prior from other tasks for severe blur removal 从其他任务中抽出时间进行严重模糊消除
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-05-10 DOI: 10.1016/j.cviu.2024.104027
Pei Wang , Yu Zhu , Danna Xue , Qingsen Yan , Jinqiu Sun , Sung-eui Yoon , Yanning Zhang
{"title":"Take a prior from other tasks for severe blur removal","authors":"Pei Wang ,&nbsp;Yu Zhu ,&nbsp;Danna Xue ,&nbsp;Qingsen Yan ,&nbsp;Jinqiu Sun ,&nbsp;Sung-eui Yoon ,&nbsp;Yanning Zhang","doi":"10.1016/j.cviu.2024.104027","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104027","url":null,"abstract":"<div><p>Recovering clear structures from severely blurry inputs is a huge challenge due to the detail loss and ambiguous semantics. Although segmentation maps can help deblur facial images, their effectiveness is limited in complex natural scenes because they ignore the detailed structures necessary for deblurring. Furthermore, direct segmentation of blurry images may introduce error propagation. To alleviate the semantic confusion and avoid error propagation, we propose utilizing high-level vision tasks, such as classification, to learn a comprehensive prior for severe blur removal. We propose a feature learning strategy based on knowledge distillation, which aims to learn the priors with global contexts and sharp local structures. To integrate the priors effectively, we propose a semantic prior embedding layer with multi-level aggregation and semantic attention. We validate our method on natural image deblurring benchmarks by introducing the priors to various models, including UNet and mainstream deblurring baselines, to demonstrate its effectiveness and generalization ability. The results show that our approach outperforms existing methods on severe blur removal with our plug-and-play semantic priors.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141077672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Other tokens matter: Exploring global and local features of Vision Transformers for Object Re-Identification 其他标记也很重要探索视觉变换器的全局和局部特征,实现物体再识别
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-05-03 DOI: 10.1016/j.cviu.2024.104030
Yingquan Wang , Pingping Zhang , Dong Wang , Huchuan Lu
{"title":"Other tokens matter: Exploring global and local features of Vision Transformers for Object Re-Identification","authors":"Yingquan Wang ,&nbsp;Pingping Zhang ,&nbsp;Dong Wang ,&nbsp;Huchuan Lu","doi":"10.1016/j.cviu.2024.104030","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104030","url":null,"abstract":"<div><p>Object Re-Identification (Re-ID) aims to identify and retrieve specific objects from images captured at different places and times. Recently, object Re-ID has achieved great success with the advances of Vision Transformers (ViT). However, the effects of the global–local relation have not been fully explored in Transformers for object Re-ID. In this work, we first explore the influence of global and local features of ViT and then further propose a novel Global–Local Transformer (GLTrans) for high-performance object Re-ID. We find that the features from last few layers of ViT already have a strong representational ability, and the global and local information can mutually enhance each other. Based on this fact, we propose a Global Aggregation Encoder (GAE) to utilize the class tokens of the last few Transformer layers and learn comprehensive global features effectively. Meanwhile, we propose the Local Multi-layer Fusion (LMF) which leverages both the global cues from GAE and multi-layer patch tokens to explore the discriminative local representations. Extensive experiments demonstrate that our proposed method achieves superior performance on four object Re-ID benchmarks. The code is available at <span>https://github.com/AWangYQ/GLTrans</span><svg><path></path></svg>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140901235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An unsupervised multi-focus image fusion method via dual-channel convolutional network and discriminator 通过双通道卷积网络和判别器实现无监督多焦点图像融合方法
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-05-01 DOI: 10.1016/j.cviu.2024.104029
Lixing Fang , Xiangxiang Wang , Junli Zhao , Zhenkuan Pan , Hui Li , Yi Li
{"title":"An unsupervised multi-focus image fusion method via dual-channel convolutional network and discriminator","authors":"Lixing Fang ,&nbsp;Xiangxiang Wang ,&nbsp;Junli Zhao ,&nbsp;Zhenkuan Pan ,&nbsp;Hui Li ,&nbsp;Yi Li","doi":"10.1016/j.cviu.2024.104029","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104029","url":null,"abstract":"<div><p>The challenge in multi-focus image fusion tasks lies in accurately preserving the complementary information from the source images in the fused image. However, existing datasets often lack ground truth images, making it difficult for some full-reference loss functions (such as SSIM) to effectively participate in model training, thereby further affecting the performance of retaining source image details. To address this issue, this paper proposes an unsupervised dual-channel dense convolutional method, DCD, for multi-focus image fusion. DCD designs Patch processing blocks specifically for the fusion task, which segment the source image pairs into equally sized patches and evaluate their information to obtain a reconstructed image and a set of adaptive weight coefficients. The reconstructed image is used as the reference image, enabling unsupervised methods to utilize full-reference loss functions in training and overcoming the challenge of lacking labeled data in the training set. Furthermore, considering that the human visual system (HVS) is more sensitive to brightness than color, DCD trains the dual-channel network using both RGB images and their luminance components. This allows the network to focus more on the brightness information while preserving the color and gradient details of the source images, resulting in fused images that are more compatible with the HVS. The adaptive weight coefficients obtained through the Patch processing blocks are also used to determine the degree of preservation of the brightness information in the source images. Finally, comparative experiments on different datasets also demonstrate the superior performance of DCD in terms of fused image quality compared to other methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140880051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight all-focused light field rendering 轻量级全聚焦光场渲染
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-04-27 DOI: 10.1016/j.cviu.2024.104031
Tomáš Chlubna , Tomáš Milet , Pavel Zemčík
{"title":"Lightweight all-focused light field rendering","authors":"Tomáš Chlubna ,&nbsp;Tomáš Milet ,&nbsp;Pavel Zemčík","doi":"10.1016/j.cviu.2024.104031","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104031","url":null,"abstract":"<div><p>This paper proposes a novel real-time method for high-quality view interpolation from light field. The proposal is a lightweight method, which can be used with consumer GPU, reaching same or better quality than existing methods, in a shorter time, with significantly smaller memory requirements. Light field belongs to image-based rendering methods that can produce realistic images without computationally demanding algorithms. The novel view is synthesized from multiple input images of the same scene, captured at different camera positions. Standard rendering techniques, such as rasterization or ray-tracing, are limited in terms of quality, memory footprint, and speed. Light field rendering methods often produce unwanted artifacts resembling ghosting or blur in certain parts of the scene due to unknown geometry of the scene. The proposed method estimates the geometry for each pixel as an optimal focusing distance to mitigate the artifacts. The focusing distance determines which pixels from the input images are mixed to produce the final view. State-of-the-art methods use a constant-step pixel matching scan that iterates over a range of focusing distances. The scan searches for a distance with the smallest color dispersion of the contributing pixels, assuming that they belong to the same spot in the scene. The paper proposes an optimal scanning strategy of the focusing range, an improved color dispersion metric, and other minor improvements, such as sampling block size adjustment, out-of-bounds sampling, and filtering. Experimental results show that the proposal uses less resources, achieves better visual quality, and is significantly faster than existing light field rendering methods. The proposal is <span><math><mrow><mn>8</mn><mo>×</mo></mrow></math></span> faster than the methods in the same category. The proposal uses only four closest views from the light field data and reduces the necessary data transfer. Existing methods often require the full light field grid, which is typically 8 × 8 images large. Additionally, a new 4K light field dataset, containing scenes of various types, was created and published. An optimal novel method for light field acquisition is also proposed and used to create the dataset.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140825344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conditioning diffusion models via attributes and semantic masks for face generation 通过属性和语义掩码调节扩散模型以生成人脸
IF 4.5 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-04-27 DOI: 10.1016/j.cviu.2024.104026
Giuseppe Lisanti, Nico Giambi
{"title":"Conditioning diffusion models via attributes and semantic masks for face generation","authors":"Giuseppe Lisanti,&nbsp;Nico Giambi","doi":"10.1016/j.cviu.2024.104026","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104026","url":null,"abstract":"<div><p>Deep generative models have shown impressive results in generating realistic images of faces. GANs managed to generate high-quality, high-fidelity images when conditioned on semantic masks, but they still lack the ability to diversify their output. Diffusion models partially solve this problem and are able to generate diverse samples given the same condition. This paper introduces a novel strategy for enhancing diffusion models through multi-conditioning, harnessing cross-attention mechanisms to utilize multiple feature sets, ultimately enabling the generation of high-quality and controllable images. The proposed method extends previous approaches by introducing conditioning on both attributes and semantic masks, ensuring finer control over the generated face images. In order to improve the training time and the generation quality, the impact of applying perceptual-focused loss weighting into the latent space instead of the pixel space is also investigated. The proposed solution has been evaluated on the CelebA-HQ dataset, and it can generate realistic and diverse samples while allowing for fine-grained control over multiple attributes and semantic regions. Experiments on the DeepFashion dataset have also been performed in order to analyze the capability of the proposed model to generalize to different domains. In addition, an ablation study has been conducted to evaluate the impact of different conditioning strategies on the quality and diversity of the generated images.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001073/pdfft?md5=72f1d087600c3806c03661cd66fb5a1d&pid=1-s2.0-S1077314224001073-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140901271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信