{"title":"Unsupervised co-generation of foreground–background segmentation from Text-to-Image synthesis","authors":"Yeruru Asrar Ahmed, Anurag Mittal","doi":"10.1016/j.cviu.2024.104223","DOIUrl":"10.1016/j.cviu.2024.104223","url":null,"abstract":"<div><div>Text-to-Image (T2I) synthesis is a challenging task requiring modelling both textual and image domains and their relationship. The substantial improvement in image quality achieved by recent works has paved the way for numerous applications such as language-aided image editing, computer-aided design, text-based image retrieval, and training data augmentation. In this work, we ask a simple question: Along with realistic images, can we obtain any useful by-product (<em>e.g.</em> foreground/background or multi-class segmentation masks, detection labels) in an unsupervised way that will also benefit other computer vision tasks and applications?. In an attempt to answer this question, we explore generating realistic images and their corresponding foreground/background segmentation masks from the given text. To achieve this, we experiment the concept of co-segmentation along with GAN. Specifically, a novel GAN architecture called Co-Segmentation Inspired GAN (COS-GAN) is proposed that generates two or more images simultaneously from different noise vectors and utilises a spatial co-attention mechanism between the image features to produce realistic segmentation masks for each of the generated images. The advantages of such an architecture are two-fold: (1) The generated segmentation masks can be used to focus on foreground and background exclusively to improve the quality of generated images, and (2) the segmentation masks can be used as a training target for other tasks, such as object localisation and segmentation. Extensive experiments conducted on CUB, Oxford-102, and COCO datasets show that COS-GAN is able to improve visual quality and generate reliable foreground/background masks for the generated images.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104223"},"PeriodicalIF":4.3,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qi Jia , Xiaomei Feng , Wei Zhang , Yu Liu , Nan Pu , Nicu Sebe
{"title":"Bilevel progressive homography estimation via correlative region-focused transformer","authors":"Qi Jia , Xiaomei Feng , Wei Zhang , Yu Liu , Nan Pu , Nicu Sebe","doi":"10.1016/j.cviu.2024.104209","DOIUrl":"10.1016/j.cviu.2024.104209","url":null,"abstract":"<div><div>We propose a novel correlative region-focused transformer for accurate homography estimation by a bilevel progressive architecture. Existing methods typically consider the entire image features to establish correlations for a pair of input images, but irrelevant regions often introduce mismatches and outliers. In contrast, our network effectively mitigates the negative impact of irrelevant regions through a bilevel progressive homography estimation architecture. Specifically, in the outer iteration, we progressively estimate the homography matrix at different feature scales; in the inner iteration, we dynamically extract correlative regions and progressively focus on their corresponding features from both inputs. Moreover, we develop a quadtree attention mechanism based on the transformer to explicitly capture the correspondence between the input images, localizing and cropping the correlative regions for the next iteration. This progressive training strategy enhances feature consistency and enables precise alignment with comparable inference rates. Extensive experiments on qualitative and quantitative comparisons show that the proposed method exhibits competitive alignment results while reducing the mean average corner error (MACE) on the MS-COCO dataset compared to previous methods, without increasing additional parameter cost.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104209"},"PeriodicalIF":4.3,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiyi Zhang , Zhiwen Ying , Ying Zheng , Cuiling Wu , Nannan Li , Fangfang Wang , Jun Wang , Xianzhong Feng , Xiaogang Xu
{"title":"Leaf cultivar identification via prototype-enhanced learning","authors":"Yiyi Zhang , Zhiwen Ying , Ying Zheng , Cuiling Wu , Nannan Li , Fangfang Wang , Jun Wang , Xianzhong Feng , Xiaogang Xu","doi":"10.1016/j.cviu.2024.104221","DOIUrl":"10.1016/j.cviu.2024.104221","url":null,"abstract":"<div><div>Leaf cultivar identification, as a typical task of ultra-fine-grained visual classification (UFGVC), is facing a huge challenge due to the high similarity among different varieties. In practice, an instance may be related to multiple varieties to varying degrees, especially in the UFGVC datasets. However, deep learning methods trained on one-hot labels fail to reflect patterns shared across categories and thus perform poorly on this task. As an analogy to natural language processing (NLP), by capturing the co-relation between labels, label embedding can select the most informative words and neglect irrelevant ones when predicting different labels. Based on this intuition, we propose a novel method named Prototype-enhanced Learning (PEL), which is predicated on the assumption that label embedding encoded with the inter-class relationships would force the image classification model to focus on discriminative patterns. In addition, a new prototype update module is put forward to learn inter-class relations by capturing label semantic overlap and iteratively update prototypes to generate continuously enhanced soft targets. Prototype-enhanced soft labels not only contain original one-hot label information, but also introduce rich inter-category semantic association information, thus providing more effective supervision for deep model training. Extensive experimental results on 7 public datasets show that our method can significantly improve the performance on the task of ultra-fine-grained visual classification. The code is available at <span><span>https://github.com/YIYIZH/PEL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104221"},"PeriodicalIF":4.3,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanheng Lv, Lulu Pan, Ke Xu, Guo Li, Wenbo Zhang, Lingxiao Li, Le Lei
{"title":"Enhanced local multi-windows attention network for lightweight image super-resolution","authors":"Yanheng Lv, Lulu Pan, Ke Xu, Guo Li, Wenbo Zhang, Lingxiao Li, Le Lei","doi":"10.1016/j.cviu.2024.104217","DOIUrl":"10.1016/j.cviu.2024.104217","url":null,"abstract":"<div><div>Since the global self-attention mechanism can capture long-distance dependencies well, Transformer-based methods have achieved remarkable performance in many vision tasks, including single-image super-resolution (SISR). However, there are strong local self-similarities in images, if the global self-attention mechanism is still used for image processing, it may lead to excessive use of computing resources on parts of the image with weak correlation. Especially in the high-resolution large-size image, the global self-attention will lead to a large number of redundant calculations. To solve this problem, we propose the Enhanced Local Multi-windows Attention Network (ELMA), which contains two main designs. First, different from the traditional self-attention based on square window partition, we propose a Multi-windows Self-Attention (M-WSA) which uses a new window partitioning mechanism to obtain different types of local long-distance dependencies. Compared with original self-attention mechanisms commonly used in other SR networks, M-WSA reduces computational complexity and achieves superior performance through analysis and experiments. Secondly, we propose a Spatial Gated Network (SGN) as a feed-forward network, which can effectively overcome the problem of intermediate channel redundancy in traditional MLP, thereby improving the parameter utilization and computational efficiency of the network. Meanwhile, SGN introduces spatial information into the feed-forward network that traditional MLP cannot obtain. It can better understand and use the spatial structure information in the image, and enhances the network performance and generalization ability. Extensive experiments show that ELMA achieves competitive performance compared to state-of-the-art methods while maintaining fewer parameters and computational costs.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104217"},"PeriodicalIF":4.3,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weimin Yuan , Yuanyuan Wang , Ruirui Fan , Yuxuan Zhang , Guangmei Wei , Cai Meng , Xiangzhi Bai
{"title":"Simultaneous image denoising and completion through convolutional sparse representation and nonlocal self-similarity","authors":"Weimin Yuan , Yuanyuan Wang , Ruirui Fan , Yuxuan Zhang , Guangmei Wei , Cai Meng , Xiangzhi Bai","doi":"10.1016/j.cviu.2024.104216","DOIUrl":"10.1016/j.cviu.2024.104216","url":null,"abstract":"<div><div>Low rank matrix approximation (LRMA) has been widely studied due to its capability of approximating original image from the degraded image. According to the characteristics of degraded images, image denoising and image completion have become research objects. Existing methods are usually designed for a single task. In this paper, focusing on the task of simultaneous image denoising and completion, we propose a weighted low rank sparse representation model and the corresponding efficient algorithm based on LRMA. The proposed method integrates convolutional analysis sparse representation (ASR) and nonlocal statistical modeling to maintain local smoothness and nonlocal self-similarity (NLSM) of natural images. More importantly, we explore the alternating direction method of multipliers (ADMM) to solve the above inverse problem efficiently due to the complexity of simultaneous image denoising and completion. We conduct experiments on image completion for partial random samples and mask removal with different noise levels. Extensive experiments on four datasets, i.e., Set12, Kodak, McMaster, and CBSD68, show that the proposed method prevents the transmission of noise while completing images and has achieved better quantitative results and human visual quality compared to 17 methods. The proposed method achieves (1.9%, 1.8%, 4.2%, and 3.7%) gains in average PSNR and (4.2%, 2.9%, 6.7%, and 6.6%) gains in average SSIM over the sub-optimal method across the four datasets, respectively. We also demonstrate that our method can handle the challenging scenarios well. Source code is available at <span><span>https://github.com/weimin581/demo_CSRNS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104216"},"PeriodicalIF":4.3,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihao Zhang , Jie He , Mouquan Shen , Xianqiang Yang
{"title":"Seam estimation based on dense matching for parallax-tolerant image stitching","authors":"Zhihao Zhang , Jie He , Mouquan Shen , Xianqiang Yang","doi":"10.1016/j.cviu.2024.104219","DOIUrl":"10.1016/j.cviu.2024.104219","url":null,"abstract":"<div><div>Image stitching with large parallax poses a significant challenge in the field of computer vision. Existing seam-based approaches attempt to address parallax artifacts by stitching images along seams. However, issues such as object mismatches, disappearances, and duplications still arise occasionally, primarily due to inaccurate alignment of dense pixels or inappropriate seam estimation methods. In this paper, we propose a robust seam-based parallax-tolerant image stitching method that leverages dense flow estimation from state-of-the-art approaches. Firstly, we develop a seam estimation method that does not require pre-estimation of image warping model. Instead, it directly estimates the seam by measuring the local smoothness of the optical flow field and incorporating a penalty term for duplications. Subsequently, we design an iterative algorithm that utilizes the location of estimated seam to solve a spatial smooth warping model and eliminate outlier corresponding pairs. By employing this approach, we effectively address the intertwined challenges of estimating the warping model and seam. Experiment on real-world images shows that our proposed method achieves superior local alignment accuracy near the stitching seam and outperforms other state-of-the-art techniques on visual stitching result. Code is available at <span><span>https://github.com/zhihao0512/dense-matching-image-stitching</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104219"},"PeriodicalIF":4.3,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hengjia Hu , Mengnan Liang , Congcong Wang, Meng Zhao, Fan Shi, Chao Zhang, Yilin Han
{"title":"Monocular depth estimation with boundary attention mechanism and Shifted Window Adaptive Bins","authors":"Hengjia Hu , Mengnan Liang , Congcong Wang, Meng Zhao, Fan Shi, Chao Zhang, Yilin Han","doi":"10.1016/j.cviu.2024.104220","DOIUrl":"10.1016/j.cviu.2024.104220","url":null,"abstract":"<div><div>Monocular depth estimation is a classic research topic in computer vision. In recent years, development of Convolutional Neural Networks (CNNs) has facilitated significant breakthroughs in this field. However, there still exist two challenges: (1) The network struggles to effectively fuse edge features in the feature fusion stage, which ultimately results in the loss of structure or boundary distortion of objects in the scene. (2) Classification based studies typically depend on Transformers for global modeling, a process that often introduces substantial computational complexity overhead as described in Equation 2. In this paper, we propose two modules to address the aforementioned issues. The first module is the Boundary Attention Module (BAM), which leverages the attention mechanism to enhance the ability of the network to perceive object boundaries during the feature fusion stage. In addition, to mitigate the computational complexity overhead resulting from predicting adaptive bins, we propose a Shift Window Adaptive Bins (SWAB) module to reduce the amount of computation in global modeling. The proposed method is evaluated on three public datasets, NYU Depth V2, KITTI and SUNRGB-D, and demonstrates state-of-the-art (SOTA) performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104220"},"PeriodicalIF":4.3,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Can Peng , Piotr Koniusz , Kaiyu Guo , Brian C. Lovell , Peyman Moghadam
{"title":"Multivariate prototype representation for domain-generalized incremental learning","authors":"Can Peng , Piotr Koniusz , Kaiyu Guo , Brian C. Lovell , Peyman Moghadam","doi":"10.1016/j.cviu.2024.104215","DOIUrl":"10.1016/j.cviu.2024.104215","url":null,"abstract":"<div><div>Deep learning models often suffer from catastrophic forgetting when fine-tuned with samples of new classes. This issue becomes even more challenging when there is a domain shift between training and testing data. In this paper, we address the critical yet less explored Domain-Generalized Class-Incremental Learning (DGCIL) task. We propose a DGCIL approach designed to memorize old classes, adapt to new classes, and reliably classify objects from unseen domains. Specifically, our loss formulation maintains classification boundaries while suppressing domain-specific information for each class. Without storing old exemplars, we employ knowledge distillation and estimate the drift of old class prototypes as incremental training progresses. Our prototype representations are based on multivariate Normal distributions, with means and covariances continually adapted to reflect evolving model features, providing effective representations for old classes. We then sample pseudo-features for these old classes from the adapted Normal distributions using Cholesky decomposition. Unlike previous pseudo-feature sampling strategies that rely solely on average mean prototypes, our method captures richer semantic variations. Experiments on several benchmarks demonstrate the superior performance of our method compared to the state of the art.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104215"},"PeriodicalIF":4.3,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diffusion Models for Counterfactual Explanations","authors":"Guillaume Jeanneret, Loïc Simon, Frédéric Jurie","doi":"10.1016/j.cviu.2024.104207","DOIUrl":"10.1016/j.cviu.2024.104207","url":null,"abstract":"<div><div>Counterfactual explanations have demonstrated promising results as a post-hoc framework to improve the explanatory power of image classifiers. Herein, this paper proposes DiME, a method that allows the generation of counterfactual images using the latest diffusion models. The proposed method uses a guided generative diffusion process to exploit the gradients of the target classifier to generate counterfactual explanations of the input instances. Furthermore, we examine present strategies for assessing spurious correlations and expand the assessment methods by presenting a novel measure, Correlation Difference, which is more efficient at detecting such correlations. The provided work includes a comprehensive ablation study and a thorough experimental validation demonstrating that the proposed algorithm outperforms previous state-of-the-art results on the CelebA, CelebAHQ and BDD100k datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104207"},"PeriodicalIF":4.3,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"3D scene generation for zero-shot learning using ChatGPT guided language prompts","authors":"Sahar Ahmadi , Ali Cheraghian , Townim Faisal Chowdhury , Morteza Saberi , Shafin Rahman","doi":"10.1016/j.cviu.2024.104211","DOIUrl":"10.1016/j.cviu.2024.104211","url":null,"abstract":"<div><div>Zero-shot learning in the realm of 3D point cloud data remains relatively unexplored compared to its 2D image counterpart. This domain introduces fresh challenges due to the absence of robust pre-trained feature extraction models. To tackle this, we introduce a prompt-guided method for 3D scene generation and supervision, enhancing the network’s ability to comprehend the intricate relationships between seen and unseen objects. Initially, we utilize basic prompts resembling scene annotations generated from one or two point cloud objects. Recognizing the limited diversity of basic prompts, we employ ChatGPT to expand them, enriching the contextual information within the descriptions. Subsequently, leveraging these descriptions, we arrange point cloud objects’ coordinates to fabricate augmented 3D scenes. Lastly, employing contrastive learning, we train our proposed architecture end-to-end, utilizing pairs of 3D scenes and prompt-based captions. We posit that 3D scenes facilitate more efficient object relationships than individual objects, as demonstrated by the effectiveness of language models like BERT in contextual understanding. Our prompt-guided scene generation method amalgamates data augmentation and prompt-based annotation, thereby enhancing 3D ZSL performance. We present ZSL and generalized ZSL results on both synthetic (ModelNet40, ModelNet10, and ShapeNet) and real-scanned (ScanOjbectNN) 3D object datasets. Furthermore, we challenge the model by training with synthetic data and testing with real-scanned data, achieving state-of-the-art performance compared to existing 2D and 3D ZSL methods in the literature. Codes and models are available at: <span><span>https://github.com/saharahmadisohraviyeh/ChatGPT_ZSL_3D</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104211"},"PeriodicalIF":4.3,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}