{"title":"Multi-dimensional attention-aided transposed ConvBiLSTM network for hyperspectral image super-resolution","authors":"","doi":"10.1016/j.cviu.2024.104096","DOIUrl":"10.1016/j.cviu.2024.104096","url":null,"abstract":"<div><p>Hyperspectral (HS) image always suffers from the deficiency of low spatial resolution, compared with conventional optical image types, which has limited its further applications in remote sensing areas. Therefore, HS image super-resolution (SR) techniques are broadly employed in order to observe finer spatial structures while preserving the spectra of ground covers. In this paper, a novel multi-dimensional attention-aided transposed convolutional long-short term memory (LSTM) network is proposed for single HS image super-resolution task. The proposed network employs the convolutional bi-directional LSTM for the purpose of local and non-local spatial–spectral feature explorations, and transposed convolution for the purpose of image amplification and reconstruction. Moreover, a multi-dimensional attention module is proposed, aiming to capture the salient features on spectral, channel, and spatial dimensions, simultaneously, to further improve the learning abilities of network. Experiments on four commonly-used HS images demonstrate the effectiveness of this approach, compared with several state-of-the-art deep learning-based SR methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141962989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cascaded UNet for progressive noise residual prediction for structure-preserving video denoising","authors":"","doi":"10.1016/j.cviu.2024.104103","DOIUrl":"10.1016/j.cviu.2024.104103","url":null,"abstract":"<div><p>The prominence of high-quality video services has become so substantial that by 2030, it is estimated that approximately 80% of internet traffic will consist of videos. On the contrary, video denoising remains a relatively unexplored and intricate field, presenting more substantial challenges compared to image denoising. Many published deep learning video denoising algorithms typically rely on simple, efficient single encoder–decoder networks, but they have inherent limitations in preserving intricate image details and effectively managing noise information propagation for noise residue modelling. In response to these challenges, the proposed work introduces an innovative approach; in terms of utilization of cascaded UNets for progressive noise residual prediction in video denoising. This multi-stage encoder–decoder architecture is meticulously designed to accurately predict noise residual maps, thereby preserving the locally fine details within video content as represented by SSIM. The proposed network has undergone extensive end-to-end training from scratch without explicit motion compensation to reduce complexity. In terms of the more rigorous SSIM metric, the proposed network outperformed all video denoising methods while maintaining a comparable PSNR.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joint pyramidal perceptual attention and hierarchical consistency constraint for gaze estimation","authors":"","doi":"10.1016/j.cviu.2024.104105","DOIUrl":"10.1016/j.cviu.2024.104105","url":null,"abstract":"<div><p>Eye gaze provides valuable cues about human intent, making gaze estimation a hot topic. Extracting multi-scale information has recently proven effective for gaze estimation in complex scenarios. However, existing methods for estimating gaze based on multi-scale features tend to focus only on information from single-level feature maps. Furthermore, information across different scales may also lack relevance. To address these issues, we propose a novel joint pyramidal perceptual attention and hierarchical consistency constraint (PaCo) for gaze estimation. The proposed PaCo consists of two main components: pyramidal perceptual attention module (PPAM) and hierarchical consistency constraint (HCC). Specifically, PPAM first extracts multi-scale spatial features using a pyramid structure, and then aggregates information from coarse granularity to fine granularity. In this way, PPAM enables the model to simultaneously focus on both the eye region and facial region at multiple scales. Then, HCC makes constrains consistency on low-level and high-level features, aiming to enhance the gaze semantic consistency between different feature levels. With the combination of PPAM and HCC, PaCo can learn more discriminative features in complex situations. Extensive experimental results show that PaCo achieves significant performance improvements on challenging datasets such as Gaze360, MPIIFaceGaze, and RT-GENE,reducing errors to 10.27<span><math><mo>°</mo></math></span>, 3.23<span><math><mo>°</mo></math></span>, 6.46<span><math><mo>°</mo></math></span>, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UC-former: A multi-scale image deraining network using enhanced transformer","authors":"","doi":"10.1016/j.cviu.2024.104097","DOIUrl":"10.1016/j.cviu.2024.104097","url":null,"abstract":"<div><p>While convolutional neural networks (CNN) have achieved remarkable performance in single image deraining tasks, it is still a very challenging task due to CNN’s limited receptive field and the unreality of the output image. In this paper, UC-former, an effective and efficient U-shaped architecture based on transformer for image deraining was presented. In UC-former, there are two core designs to avoid heavy self-attention computation and inefficient communications across encoder and decoder. First, we propose a novel channel across Transformer block, which computes self-attention between channels. It significantly reduces the computational complexity of high-resolution rain maps while capturing global context. Second, we propose a multi-scale feature fusion module between the encoder and decoder to combine low-level local features and high-level non-local features. In addition, we employ depth-wise convolution and H-Swish non-linear activation function in Transformer Blocks to enhance rain removal authenticity. Extensive experiments indicate that our method outperforms the state-of-the-art deraining approaches on synthetic and real-world rainy datasets.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Invisible gas detection: An RGB-thermal cross attention network and a new benchmark","authors":"","doi":"10.1016/j.cviu.2024.104099","DOIUrl":"10.1016/j.cviu.2024.104099","url":null,"abstract":"<div><p>The widespread use of various chemical gases in industrial processes necessitates effective measures to prevent their leakage during transportation and storage, given their high toxicity. Thermal infrared-based computer vision detection techniques provide a straightforward approach to identify gas leakage areas. However, the development of high-quality algorithms has been challenging due to the low texture in thermal images and the lack of open-source datasets. In this paper, we present the <strong>R</strong>GB-<strong>T</strong>hermal <strong>C</strong>ross <strong>A</strong>ttention <strong>N</strong>etwork (RT-CAN), which employs an RGB-assisted two-stream network architecture to integrate texture information from RGB images and gas area information from thermal images. Additionally, to facilitate the research of invisible gas detection, we introduce Gas-DB, an extensive open-source gas detection database including about 1.3K well-annotated RGB-thermal images with eight variant collection scenes. Experimental results demonstrate that our method successfully leverages the advantages of both modalities, achieving state-of-the-art (SOTA) performance among RGB-thermal methods, surpassing single-stream SOTA models in terms of accuracy, Intersection of Union (IoU), and F2 metrics by 4.86%, 5.65%, and 4.88%, respectively. The code and data can be found at <span><span>https://github.com/logic112358/RT-CAN</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bidirectional brain image translation using transfer learning from generic pre-trained models","authors":"","doi":"10.1016/j.cviu.2024.104100","DOIUrl":"10.1016/j.cviu.2024.104100","url":null,"abstract":"<div><p>Brain imaging plays a crucial role in the diagnosis and treatment of various neurological disorders, providing valuable insights into the structure and function of the brain. Techniques such as magnetic resonance imaging (MRI) and computed tomography (CT) enable non-invasive visualization of the brain, aiding in the understanding of brain anatomy, abnormalities, and functional connectivity. However, cost and radiation dose may limit the acquisition of specific image modalities, so medical image synthesis can be used to generate required medical images without actual addition. CycleGAN and other GANs are valuable tools for generating synthetic images across various fields. In the medical domain, where obtaining labeled medical images is labor-intensive and expensive, addressing data scarcity is a major challenge. Recent studies propose using transfer learning to overcome this issue. This involves adapting pre-trained CycleGAN models, initially trained on non-medical data, to generate realistic medical images. In this work, transfer learning was applied to the task of MR-CT image translation and vice versa using 18 pre-trained non-medical models, and the models were fine-tuned to have the best result. The models’ performance was evaluated using four widely used image quality metrics: Peak-signal-to-noise-ratio, Structural Similarity Index, Universal Quality Index, and Visual Information Fidelity. Quantitative evaluation and qualitative perceptual analysis by radiologists demonstrate the potential of transfer learning in medical imaging and the effectiveness of the generic pre-trained model. The results provide compelling evidence of the model’s exceptional performance, which can be attributed to the high quality and similarity of the training images to actual human brain images. These results underscore the significance of carefully selecting appropriate and representative training images to optimize performance in brain image analysis tasks.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image semantic segmentation of indoor scenes: A survey","authors":"","doi":"10.1016/j.cviu.2024.104102","DOIUrl":"10.1016/j.cviu.2024.104102","url":null,"abstract":"<div><p>This survey provides a comprehensive evaluation of various deep learning-based segmentation architectures. It covers a wide range of models, from traditional ones like FCN and PSPNet to more modern approaches like SegFormer and FAN. In addition to assessing the methods in terms of segmentation accuracy, we propose to also evaluate the methods in terms of temporal consistency and corruption vulnerability. Most of the existing surveys on semantic segmentation focus on outdoor datasets. In contrast, this survey focuses on indoor scenarios to enhance the applicability of segmentation methods in this specific domain. Furthermore, our evaluation consists of a performance analysis of the methods in prevalent real-world segmentation scenarios that pose particular challenges. These complex situations involve scenes impacted by diverse forms of noise, blur corruptions, camera movements, optical aberrations, among other factors. By jointly exploring the segmentation accuracy, temporal consistency, and corruption vulnerability in challenging real-world situations, our survey offers insights that go beyond existing surveys, facilitating the understanding and development of better image segmentation methods for indoor scenes.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001838/pdfft?md5=2d19fe112ea2fe5f2c0ab7afa65c3059&pid=1-s2.0-S1077314224001838-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141963153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural image re-exposure","authors":"","doi":"10.1016/j.cviu.2024.104094","DOIUrl":"10.1016/j.cviu.2024.104094","url":null,"abstract":"<div><p>Images and videos often suffer from issues such as motion blur, video discontinuity, or rolling shutter artifacts. Prior studies typically focus on designing specific algorithms to address individual issues. In this paper, we highlight that these issues, albeit differently manifested, fundamentally stem from sub-optimal exposure processes. With this insight, we propose a paradigm termed re-exposure, which resolves the aforementioned issues by performing exposure simulation. Following this paradigm, we design a new architecture, which constructs visual content representation from images and event camera data, and performs exposure simulation in a controllable manner. Experiments demonstrate that, using only a single model, the proposed architecture can effectively address multiple visual issues, including motion blur, video discontinuity, and rolling shutter artifacts, even when these issues co-occur.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141842874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uni MS-PS: A multi-scale encoder-decoder transformer for universal photometric stereo","authors":"","doi":"10.1016/j.cviu.2024.104093","DOIUrl":"10.1016/j.cviu.2024.104093","url":null,"abstract":"<div><p>Photometric Stereo (PS) addresses the challenge of reconstructing a three-dimensional (3D) representation of an object by estimating the 3D normals at all points on the object’s surface. This is achieved through the analysis of at least three photographs, all taken from the same viewpoint but with distinct lighting conditions. This paper introduces a novel approach for Universal PS, i.e., when both the active lighting conditions and the ambient illumination are unknown. Our method employs a multi-scale encoder–decoder architecture based on Transformers that allows to accommodates images of any resolutions as well as varying number of input images. We are able to scale up to very high resolution images like 6000 pixels by 8000 pixels without losing performance and maintaining a decent memory footprint. Moreover, experiments on publicly available datasets establish that our proposed architecture improves the accuracy of the estimated normal field by a significant factor compared to state-of-the-art methods. Code and dataset available at: <span><span>https://clement-hardy.github.io/Uni-MS-PS/index.html</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141841792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image-to-image translation based face photo de-meshing using GANs","authors":"","doi":"10.1016/j.cviu.2024.104080","DOIUrl":"10.1016/j.cviu.2024.104080","url":null,"abstract":"<div><p>Most of the existing face photo de-meshing methods have accomplished promising results; there are certain quality problems with these methods like the inpainted regions would appear blurry and unpleasant boundaries becoming visible. Such artifacts cause generated face photos unreal. Therefore, we propose an effective image-to-image translation framework called Face De-meshing Using Generative Adversarial Networks (De-mesh GANs). The De-mesh GANs is a two-stage model: (i) binary mask generating module, is a three convolution layers-based encoder–decoder network architecture that automatically generates a binary mask for the meshed region, and (ii) face photo de-meshing module, is a GANs-based network that eliminates the mesh mask and synthesizes the meshed area. An arrangement of careful losses (reconstruction loss, adversarial loss, and perceptual loss) is used to reassure the better quality of the de-mesh face photos. To facilitate the training of the proposed model, we have designed a dataset of clean/corrupted photo pairs using the CelebA dataset. Qualitative and quantitative evaluations of the De-mesh GANs on real-world corrupted face photo images show better performance than the previously proposed face photo de-meshing models. Furthermore, we also offer the ablation study for performance assessment of the additional network i.e., perceptual network.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001619/pdfft?md5=69edb9b36e9f2ed6358c7a01f72da000&pid=1-s2.0-S1077314224001619-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141839568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}