Salma González-Sabbagh , Antonio Robles-Kelly , Shang Gao
{"title":"Scene-cGAN: A GAN for underwater restoration and scene depth estimation","authors":"Salma González-Sabbagh , Antonio Robles-Kelly , Shang Gao","doi":"10.1016/j.cviu.2024.104225","DOIUrl":"10.1016/j.cviu.2024.104225","url":null,"abstract":"<div><div>Despite their wide scope of application, the development of underwater models for image restoration and scene depth estimation is not a straightforward task due to the limited size and quality of underwater datasets, as well as variations in water colours resulting from attenuation, absorption and scattering phenomena in the water column. To address these challenges, we present an all-in-one conditional generative adversarial network (cGAN) called Scene-cGAN. Our cGAN is a physics-based multi-domain model designed for image dewatering, restoration and depth estimation. It comprises three generators and one discriminator. To train our Scene-cGAN, we use a multi-term loss function based on uni-directional cycle-consistency and a novel dataset. This dataset is constructed from RGB-D in-air images using spectral data and concentrations of water constituents obtained from real-world water quality surveys. This approach allows us to produce imagery consistent with the radiance and veiling light corresponding to representative water types. Additionally, we compare Scene-cGAN with current state-of-the-art methods using various datasets. Results demonstrate its competitiveness in terms of colour restoration and its effectiveness in estimating the depth information for complex underwater scenes.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104225"},"PeriodicalIF":4.3,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142653911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jacopo Burger, Giorgio Blandano, Giuseppe Maurizio Facchi, Raffaella Lanzarotti
{"title":"2S-SGCN: A two-stage stratified graph convolutional network model for facial landmark detection on 3D data","authors":"Jacopo Burger, Giorgio Blandano, Giuseppe Maurizio Facchi, Raffaella Lanzarotti","doi":"10.1016/j.cviu.2024.104227","DOIUrl":"10.1016/j.cviu.2024.104227","url":null,"abstract":"<div><div>Facial Landmark Detection (FLD) algorithms play a crucial role in numerous computer vision applications, particularly in tasks such as face recognition, head pose estimation, and facial expression analysis. While FLD on images has long been the focus, the emergence of 3D data has led to a surge of interest in FLD on it due to its potential applications in various fields, including medical research. However, automating FLD in this context presents significant challenges, such as selecting suitable network architectures, refining outputs for precise landmark localization and optimizing computational efficiency. In response, this paper presents a novel approach, the 2-Stage Stratified Graph Convolutional Network (<span>2S-SGCN</span>), which addresses these challenges comprehensively. The first stage aims to detect landmark regions using heatmap regression, which leverages both local and long-range dependencies through a stratified approach. In the second stage, 3D landmarks are precisely determined using a new post-processing technique, namely <span>MSE-over-mesh</span>. <span>2S-SGCN</span> ensures both efficiency and suitability for resource-constrained devices. Experimental results on 3D scans from the public Facescape and Headspace datasets, as well as on point clouds derived from FLAME meshes collected in the DAD-3DHeads dataset, demonstrate that the proposed method achieves state-of-the-art performance across various conditions. Source code is accessible at <span><span>https://github.com/gfacchi-dev/CVIU-2S-SGCN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104227"},"PeriodicalIF":4.3,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142653910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual stage semantic information based generative adversarial network for image super-resolution","authors":"Shailza Sharma , Abhinav Dhall , Shikhar Johri , Vinay Kumar , Vivek Singh","doi":"10.1016/j.cviu.2024.104226","DOIUrl":"10.1016/j.cviu.2024.104226","url":null,"abstract":"<div><div>Deep learning has revolutionized image super-resolution, yet challenges persist in preserving intricate details and avoiding overly smooth reconstructions. In this work, we introduce a novel architecture, the Residue and Semantic Feature-based Dual Subpixel Generative Adversarial Network (RSF-DSGAN), which emphasizes the critical role of semantic information in addressing these issues. The proposed generator architecture is designed with two sequential stages: the Premier Residual Stage and the Deuxième Residual Stage. These stages are concatenated to form a dual-stage upsampling process, substantially augmenting the model’s capacity for feature learning. A central innovation of our approach is the integration of semantic information directly into the generator. Specifically, feature maps derived from a pre-trained network are fused with the primary feature maps of the first stage, enriching the generator with high-level contextual cues. This semantic infusion enhances the fidelity and sharpness of reconstructed images, particularly in preserving object details and textures. Inter- and intra-residual connections are employed within these stages to maintain high-frequency details and fine textures. Additionally, spectral normalization is introduced in the discriminator to stabilize training. Comprehensive evaluations, including visual perception and mean opinion scores, demonstrate that RSF-DSGAN, with its emphasis on semantic information, outperforms current state-of-the-art super-resolution methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104226"},"PeriodicalIF":4.3,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ling Fu , Zijie Wu , Yingying Zhu , Yuliang Liu , Xiang Bai
{"title":"Enhancing scene text detectors with realistic text image synthesis using diffusion models","authors":"Ling Fu , Zijie Wu , Yingying Zhu , Yuliang Liu , Xiang Bai","doi":"10.1016/j.cviu.2024.104224","DOIUrl":"10.1016/j.cviu.2024.104224","url":null,"abstract":"<div><div>Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the <strong>Diff</strong>usion Model based <strong>Text</strong> Generator (<strong>DiffText</strong>), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background’s intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images. Code is available at: <span><span>https://github.com/99Franklin/DiffText</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104224"},"PeriodicalIF":4.3,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangxiang Wang , Lixing Fang , Junli Zhao , Zhenkuan Pan , Hui Li , Yi Li
{"title":"UUD-Fusion: An unsupervised universal image fusion approach via generative diffusion model","authors":"Xiangxiang Wang , Lixing Fang , Junli Zhao , Zhenkuan Pan , Hui Li , Yi Li","doi":"10.1016/j.cviu.2024.104218","DOIUrl":"10.1016/j.cviu.2024.104218","url":null,"abstract":"<div><div>Image fusion is a classical problem in the field of image processing whose solutions are usually not unique. The common image fusion methods can only generate a fixed fusion result based on the source image pairs. They tend to be applicable only to a specific task and have high computational costs. Hence, in this paper, a two-stage unsupervised universal image fusion with generative diffusion model is proposed, termed as UUD-Fusion. For the first stage, a strategy based on the initial fusion results is devised to offload the computational effort. For the second stage, two novel sampling algorithms based on generative diffusion model are designed. The fusion sequence generation algorithm (FSGA) searches for a series of solutions in the solution space by iterative sampling. The fusion image enhancement algorithm (FIEA) greatly improves the quality of the fused images. Qualitative and quantitative evaluations of multiple datasets with different modalities demonstrate the great versatility and effectiveness of UUD-Fusion. It is capable of solving different fusion problems, including multi-focus image fusion task, multi-exposure image fusion task, infrared and visible fusion task, and medical image fusion task. The proposed approach is superior to current state-of-the-art methods. Our code is publicly available at <span><span>https://github.com/xiangxiang-wang/UUD-Fusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104218"},"PeriodicalIF":4.3,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised co-generation of foreground–background segmentation from Text-to-Image synthesis","authors":"Yeruru Asrar Ahmed, Anurag Mittal","doi":"10.1016/j.cviu.2024.104223","DOIUrl":"10.1016/j.cviu.2024.104223","url":null,"abstract":"<div><div>Text-to-Image (T2I) synthesis is a challenging task requiring modelling both textual and image domains and their relationship. The substantial improvement in image quality achieved by recent works has paved the way for numerous applications such as language-aided image editing, computer-aided design, text-based image retrieval, and training data augmentation. In this work, we ask a simple question: Along with realistic images, can we obtain any useful by-product (<em>e.g.</em> foreground/background or multi-class segmentation masks, detection labels) in an unsupervised way that will also benefit other computer vision tasks and applications?. In an attempt to answer this question, we explore generating realistic images and their corresponding foreground/background segmentation masks from the given text. To achieve this, we experiment the concept of co-segmentation along with GAN. Specifically, a novel GAN architecture called Co-Segmentation Inspired GAN (COS-GAN) is proposed that generates two or more images simultaneously from different noise vectors and utilises a spatial co-attention mechanism between the image features to produce realistic segmentation masks for each of the generated images. The advantages of such an architecture are two-fold: (1) The generated segmentation masks can be used to focus on foreground and background exclusively to improve the quality of generated images, and (2) the segmentation masks can be used as a training target for other tasks, such as object localisation and segmentation. Extensive experiments conducted on CUB, Oxford-102, and COCO datasets show that COS-GAN is able to improve visual quality and generate reliable foreground/background masks for the generated images.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104223"},"PeriodicalIF":4.3,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qi Jia , Xiaomei Feng , Wei Zhang , Yu Liu , Nan Pu , Nicu Sebe
{"title":"Bilevel progressive homography estimation via correlative region-focused transformer","authors":"Qi Jia , Xiaomei Feng , Wei Zhang , Yu Liu , Nan Pu , Nicu Sebe","doi":"10.1016/j.cviu.2024.104209","DOIUrl":"10.1016/j.cviu.2024.104209","url":null,"abstract":"<div><div>We propose a novel correlative region-focused transformer for accurate homography estimation by a bilevel progressive architecture. Existing methods typically consider the entire image features to establish correlations for a pair of input images, but irrelevant regions often introduce mismatches and outliers. In contrast, our network effectively mitigates the negative impact of irrelevant regions through a bilevel progressive homography estimation architecture. Specifically, in the outer iteration, we progressively estimate the homography matrix at different feature scales; in the inner iteration, we dynamically extract correlative regions and progressively focus on their corresponding features from both inputs. Moreover, we develop a quadtree attention mechanism based on the transformer to explicitly capture the correspondence between the input images, localizing and cropping the correlative regions for the next iteration. This progressive training strategy enhances feature consistency and enables precise alignment with comparable inference rates. Extensive experiments on qualitative and quantitative comparisons show that the proposed method exhibits competitive alignment results while reducing the mean average corner error (MACE) on the MS-COCO dataset compared to previous methods, without increasing additional parameter cost.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104209"},"PeriodicalIF":4.3,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiyi Zhang , Zhiwen Ying , Ying Zheng , Cuiling Wu , Nannan Li , Fangfang Wang , Jun Wang , Xianzhong Feng , Xiaogang Xu
{"title":"Leaf cultivar identification via prototype-enhanced learning","authors":"Yiyi Zhang , Zhiwen Ying , Ying Zheng , Cuiling Wu , Nannan Li , Fangfang Wang , Jun Wang , Xianzhong Feng , Xiaogang Xu","doi":"10.1016/j.cviu.2024.104221","DOIUrl":"10.1016/j.cviu.2024.104221","url":null,"abstract":"<div><div>Leaf cultivar identification, as a typical task of ultra-fine-grained visual classification (UFGVC), is facing a huge challenge due to the high similarity among different varieties. In practice, an instance may be related to multiple varieties to varying degrees, especially in the UFGVC datasets. However, deep learning methods trained on one-hot labels fail to reflect patterns shared across categories and thus perform poorly on this task. As an analogy to natural language processing (NLP), by capturing the co-relation between labels, label embedding can select the most informative words and neglect irrelevant ones when predicting different labels. Based on this intuition, we propose a novel method named Prototype-enhanced Learning (PEL), which is predicated on the assumption that label embedding encoded with the inter-class relationships would force the image classification model to focus on discriminative patterns. In addition, a new prototype update module is put forward to learn inter-class relations by capturing label semantic overlap and iteratively update prototypes to generate continuously enhanced soft targets. Prototype-enhanced soft labels not only contain original one-hot label information, but also introduce rich inter-category semantic association information, thus providing more effective supervision for deep model training. Extensive experimental results on 7 public datasets show that our method can significantly improve the performance on the task of ultra-fine-grained visual classification. The code is available at <span><span>https://github.com/YIYIZH/PEL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104221"},"PeriodicalIF":4.3,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanheng Lv, Lulu Pan, Ke Xu, Guo Li, Wenbo Zhang, Lingxiao Li, Le Lei
{"title":"Enhanced local multi-windows attention network for lightweight image super-resolution","authors":"Yanheng Lv, Lulu Pan, Ke Xu, Guo Li, Wenbo Zhang, Lingxiao Li, Le Lei","doi":"10.1016/j.cviu.2024.104217","DOIUrl":"10.1016/j.cviu.2024.104217","url":null,"abstract":"<div><div>Since the global self-attention mechanism can capture long-distance dependencies well, Transformer-based methods have achieved remarkable performance in many vision tasks, including single-image super-resolution (SISR). However, there are strong local self-similarities in images, if the global self-attention mechanism is still used for image processing, it may lead to excessive use of computing resources on parts of the image with weak correlation. Especially in the high-resolution large-size image, the global self-attention will lead to a large number of redundant calculations. To solve this problem, we propose the Enhanced Local Multi-windows Attention Network (ELMA), which contains two main designs. First, different from the traditional self-attention based on square window partition, we propose a Multi-windows Self-Attention (M-WSA) which uses a new window partitioning mechanism to obtain different types of local long-distance dependencies. Compared with original self-attention mechanisms commonly used in other SR networks, M-WSA reduces computational complexity and achieves superior performance through analysis and experiments. Secondly, we propose a Spatial Gated Network (SGN) as a feed-forward network, which can effectively overcome the problem of intermediate channel redundancy in traditional MLP, thereby improving the parameter utilization and computational efficiency of the network. Meanwhile, SGN introduces spatial information into the feed-forward network that traditional MLP cannot obtain. It can better understand and use the spatial structure information in the image, and enhances the network performance and generalization ability. Extensive experiments show that ELMA achieves competitive performance compared to state-of-the-art methods while maintaining fewer parameters and computational costs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104217"},"PeriodicalIF":4.3,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weimin Yuan , Yuanyuan Wang , Ruirui Fan , Yuxuan Zhang , Guangmei Wei , Cai Meng , Xiangzhi Bai
{"title":"Simultaneous image denoising and completion through convolutional sparse representation and nonlocal self-similarity","authors":"Weimin Yuan , Yuanyuan Wang , Ruirui Fan , Yuxuan Zhang , Guangmei Wei , Cai Meng , Xiangzhi Bai","doi":"10.1016/j.cviu.2024.104216","DOIUrl":"10.1016/j.cviu.2024.104216","url":null,"abstract":"<div><div>Low rank matrix approximation (LRMA) has been widely studied due to its capability of approximating original image from the degraded image. According to the characteristics of degraded images, image denoising and image completion have become research objects. Existing methods are usually designed for a single task. In this paper, focusing on the task of simultaneous image denoising and completion, we propose a weighted low rank sparse representation model and the corresponding efficient algorithm based on LRMA. The proposed method integrates convolutional analysis sparse representation (ASR) and nonlocal statistical modeling to maintain local smoothness and nonlocal self-similarity (NLSM) of natural images. More importantly, we explore the alternating direction method of multipliers (ADMM) to solve the above inverse problem efficiently due to the complexity of simultaneous image denoising and completion. We conduct experiments on image completion for partial random samples and mask removal with different noise levels. Extensive experiments on four datasets, i.e., Set12, Kodak, McMaster, and CBSD68, show that the proposed method prevents the transmission of noise while completing images and has achieved better quantitative results and human visual quality compared to 17 methods. The proposed method achieves (1.9%, 1.8%, 4.2%, and 3.7%) gains in average PSNR and (4.2%, 2.9%, 6.7%, and 6.6%) gains in average SSIM over the sub-optimal method across the four datasets, respectively. We also demonstrate that our method can handle the challenging scenarios well. Source code is available at <span><span>https://github.com/weimin581/demo_CSRNS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104216"},"PeriodicalIF":4.3,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}