{"title":"Open-set domain adaptation with visual-language foundation models","authors":"Qing Yu , Go Irie , Kiyoharu Aizawa","doi":"10.1016/j.cviu.2024.104230","DOIUrl":"10.1016/j.cviu.2024.104230","url":null,"abstract":"<div><div>Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104230"},"PeriodicalIF":4.3,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142722825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alaa Kryeem , Noy Boutboul , Itai Bear , Shmuel Raz , Dana Eluz , Dorit Itah , Hagit Hel-Or , Ilan Shimshoni
{"title":"Action assessment in rehabilitation: Leveraging machine learning and vision-based analysis","authors":"Alaa Kryeem , Noy Boutboul , Itai Bear , Shmuel Raz , Dana Eluz , Dorit Itah , Hagit Hel-Or , Ilan Shimshoni","doi":"10.1016/j.cviu.2024.104228","DOIUrl":"10.1016/j.cviu.2024.104228","url":null,"abstract":"<div><div>Post-hip replacement rehabilitation often depends on exercises under medical supervision. Yet, the lack of therapists, financial limits, and inconsistent evaluations call for a more user-friendly, accessible approach. Our proposed solution is a scalable, affordable system based on computer vision, leveraging machine learning and 2D cameras to provide tailored monitoring. This system is designed to address the shortcomings of conventional rehab methods, facilitating effective healthcare at home. The system’s key feature is the use of DTAN deep learning approach to synchronize exercise data over time, which guarantees precise analysis and evaluation. We also introduce a ‘Golden Feature’—a spatio-temporal element that embodies the essential movement of the exercise, serving as the foundation for aligning signals and identifying crucial exercise intervals. The system employs automated feature extraction and selection, offering valuable insights into the execution of exercises and enhancing the system’s precision. Moreover, it includes a multi-label ML model that not only predicts exercise scores but also forecasts therapists’ feedback for exercises performed partially. Performance of the proposed system is shown to be predict exercise scores with accuracy between 82% and 95%. Due to the automatic feature selection, and alignment methods, the proposed framework is easily scalable to additional exercises.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104228"},"PeriodicalIF":4.3,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142757044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanyan Wei , Yilin Zhang , Kun Li , Fei Wang , Shengeng Tang , Zhao Zhang
{"title":"Leveraging vision-language prompts for real-world image restoration and enhancement","authors":"Yanyan Wei , Yilin Zhang , Kun Li , Fei Wang , Shengeng Tang , Zhao Zhang","doi":"10.1016/j.cviu.2024.104222","DOIUrl":"10.1016/j.cviu.2024.104222","url":null,"abstract":"<div><div>Significant advancements have been made in image restoration methods aimed at removing adverse weather effects. However, due to natural constraints, it is challenging to collect real-world datasets for adverse weather removal tasks. Consequently, existing methods predominantly rely on synthetic datasets, which struggle to generalize to real-world data, thereby limiting their practical utility. While some real-world adverse weather removal datasets have emerged, their design, which involves capturing ground truths at a different moment, inevitably introduces interfering discrepancies between the degraded images and the ground truths. These discrepancies include variations in brightness, color, contrast, and minor misalignments. Meanwhile, real-world datasets typically involve complex rather than singular degradation types. In many samples, degradation features are not overt, which poses immense challenges to real-world adverse weather removal methodologies. To tackle these issues, we introduce the recently prominent vision-language model, CLIP, to aid in the image restoration process. An expanded and fine-tuned CLIP model acts as an ‘expert’, leveraging the image priors acquired through large-scale pre-training to guide the operation of the image restoration model. Additionally, we generate a set of pseudo-ground-truths on sequences of degraded images to further alleviate the difficulty for the model in fitting the data. To imbue the model with more prior knowledge about degradation characteristics, we also incorporate additional synthetic training data. Lastly, the progressive learning and fine-tuning strategies employed during training enhance the model’s final performance, enabling our method to surpass existing approaches in both visual quality and objective image quality assessment metrics.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104222"},"PeriodicalIF":4.3,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RetSeg3D: Retention-based 3D semantic segmentation for autonomous driving","authors":"Gopi Krishna Erabati, Helder Araujo","doi":"10.1016/j.cviu.2024.104231","DOIUrl":"10.1016/j.cviu.2024.104231","url":null,"abstract":"<div><div>LiDAR semantic segmentation is one of the crucial tasks for scene understanding in autonomous driving. Recent trends suggest that voxel- or fusion-based methods obtain improved performance. However, the fusion-based methods are computationally expensive. On the other hand, the voxel-based methods uniformly employ local operators (e.g., 3D SparseConv) without considering the varying-density property of LiDAR point clouds, which result in inferior performance, specifically on far away sparse points due to limited receptive field. To tackle this issue, we propose novel retention block to capture long-range dependencies, maintain the receptive field of far away sparse points and design <strong>RetSeg3D</strong>, a retention-based 3D semantic segmentation model for autonomous driving. Instead of vanilla attention mechanism to model long-range dependencies, inspired by RetNet, we design cubic window multi-scale retentive self-attention (CW-MSRetSA) module with bidirectional and 3D explicit decay mechanism to introduce 3D spatial distance related prior information into the model to improve not only the receptive field but also the model capacity. Our novel retention block maintains the receptive field which significantly improve the performance of far away sparse points. We conduct extensive experiments and analysis on three large-scale datasets: SemanticKITTI, nuScenes and Waymo. Our method not only outperforms existing methods on far away sparse points but also on close and medium distance points and efficiently runs in real time at 52.1 FPS on a RTX 4090 GPU.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104231"},"PeriodicalIF":4.3,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SANet: Selective Aggregation Network for unsupervised object re-identification","authors":"Minghui Lin, Jianhua Tang, Longbin Fu, Zhengrong Zuo","doi":"10.1016/j.cviu.2024.104232","DOIUrl":"10.1016/j.cviu.2024.104232","url":null,"abstract":"<div><div>Recent advancements in unsupervised object re-identification have witnessed remarkable progress, which usually focuses on capturing fine-grained semantic information through partitioning or relying on auxiliary networks for optimizing label consistency. However, incorporating extra complex partitioning mechanisms and models leads to non-negligible optimization difficulties, resulting in limited performance gains. To address these problems, this paper presents a Selective Aggregation Network (SANet) to obtain high-quality features and labels for unsupervised object re-identification, which explores primitive fine-grained information of large-scale pre-trained models such as CLIP and designs customized modifications. Specifically, we propose an adaptive selective aggregation module that chooses a set of tokens based on CLIP’s attention scores to aggregate discriminative global features. Built upon the representations output by the adaptive selective aggregation module, we design a dynamic weighted clustering algorithm to obtain accurate confidence-weighted pseudo-class centers for contrastive learning. In addition, a dual confidence judgment strategy is introduced to refine and correct the pseudo-labels by assigning three categories of samples through their noise degree. By this means, the proposed SANet enables discriminative feature extraction and clustering refinement for more precise classification without complex architectures such as feature partitioning or auxiliary models. Extensive experiments on existing standard unsupervised object re-identification benchmarks, including Market1501, MSMT17, and Veri776, demonstrate the effectiveness of the proposed SANet method, and SANet achieves state-of-the-art results over other strong competitors.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104232"},"PeriodicalIF":4.3,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Salma González-Sabbagh , Antonio Robles-Kelly , Shang Gao
{"title":"Scene-cGAN: A GAN for underwater restoration and scene depth estimation","authors":"Salma González-Sabbagh , Antonio Robles-Kelly , Shang Gao","doi":"10.1016/j.cviu.2024.104225","DOIUrl":"10.1016/j.cviu.2024.104225","url":null,"abstract":"<div><div>Despite their wide scope of application, the development of underwater models for image restoration and scene depth estimation is not a straightforward task due to the limited size and quality of underwater datasets, as well as variations in water colours resulting from attenuation, absorption and scattering phenomena in the water column. To address these challenges, we present an all-in-one conditional generative adversarial network (cGAN) called Scene-cGAN. Our cGAN is a physics-based multi-domain model designed for image dewatering, restoration and depth estimation. It comprises three generators and one discriminator. To train our Scene-cGAN, we use a multi-term loss function based on uni-directional cycle-consistency and a novel dataset. This dataset is constructed from RGB-D in-air images using spectral data and concentrations of water constituents obtained from real-world water quality surveys. This approach allows us to produce imagery consistent with the radiance and veiling light corresponding to representative water types. Additionally, we compare Scene-cGAN with current state-of-the-art methods using various datasets. Results demonstrate its competitiveness in terms of colour restoration and its effectiveness in estimating the depth information for complex underwater scenes.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104225"},"PeriodicalIF":4.3,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142653911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jacopo Burger, Giorgio Blandano, Giuseppe Maurizio Facchi, Raffaella Lanzarotti
{"title":"2S-SGCN: A two-stage stratified graph convolutional network model for facial landmark detection on 3D data","authors":"Jacopo Burger, Giorgio Blandano, Giuseppe Maurizio Facchi, Raffaella Lanzarotti","doi":"10.1016/j.cviu.2024.104227","DOIUrl":"10.1016/j.cviu.2024.104227","url":null,"abstract":"<div><div>Facial Landmark Detection (FLD) algorithms play a crucial role in numerous computer vision applications, particularly in tasks such as face recognition, head pose estimation, and facial expression analysis. While FLD on images has long been the focus, the emergence of 3D data has led to a surge of interest in FLD on it due to its potential applications in various fields, including medical research. However, automating FLD in this context presents significant challenges, such as selecting suitable network architectures, refining outputs for precise landmark localization and optimizing computational efficiency. In response, this paper presents a novel approach, the 2-Stage Stratified Graph Convolutional Network (<span>2S-SGCN</span>), which addresses these challenges comprehensively. The first stage aims to detect landmark regions using heatmap regression, which leverages both local and long-range dependencies through a stratified approach. In the second stage, 3D landmarks are precisely determined using a new post-processing technique, namely <span>MSE-over-mesh</span>. <span>2S-SGCN</span> ensures both efficiency and suitability for resource-constrained devices. Experimental results on 3D scans from the public Facescape and Headspace datasets, as well as on point clouds derived from FLAME meshes collected in the DAD-3DHeads dataset, demonstrate that the proposed method achieves state-of-the-art performance across various conditions. Source code is accessible at <span><span>https://github.com/gfacchi-dev/CVIU-2S-SGCN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104227"},"PeriodicalIF":4.3,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142653910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual stage semantic information based generative adversarial network for image super-resolution","authors":"Shailza Sharma , Abhinav Dhall , Shikhar Johri , Vinay Kumar , Vivek Singh","doi":"10.1016/j.cviu.2024.104226","DOIUrl":"10.1016/j.cviu.2024.104226","url":null,"abstract":"<div><div>Deep learning has revolutionized image super-resolution, yet challenges persist in preserving intricate details and avoiding overly smooth reconstructions. In this work, we introduce a novel architecture, the Residue and Semantic Feature-based Dual Subpixel Generative Adversarial Network (RSF-DSGAN), which emphasizes the critical role of semantic information in addressing these issues. The proposed generator architecture is designed with two sequential stages: the Premier Residual Stage and the Deuxième Residual Stage. These stages are concatenated to form a dual-stage upsampling process, substantially augmenting the model’s capacity for feature learning. A central innovation of our approach is the integration of semantic information directly into the generator. Specifically, feature maps derived from a pre-trained network are fused with the primary feature maps of the first stage, enriching the generator with high-level contextual cues. This semantic infusion enhances the fidelity and sharpness of reconstructed images, particularly in preserving object details and textures. Inter- and intra-residual connections are employed within these stages to maintain high-frequency details and fine textures. Additionally, spectral normalization is introduced in the discriminator to stabilize training. Comprehensive evaluations, including visual perception and mean opinion scores, demonstrate that RSF-DSGAN, with its emphasis on semantic information, outperforms current state-of-the-art super-resolution methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104226"},"PeriodicalIF":4.3,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ling Fu , Zijie Wu , Yingying Zhu , Yuliang Liu , Xiang Bai
{"title":"Enhancing scene text detectors with realistic text image synthesis using diffusion models","authors":"Ling Fu , Zijie Wu , Yingying Zhu , Yuliang Liu , Xiang Bai","doi":"10.1016/j.cviu.2024.104224","DOIUrl":"10.1016/j.cviu.2024.104224","url":null,"abstract":"<div><div>Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the <strong>Diff</strong>usion Model based <strong>Text</strong> Generator (<strong>DiffText</strong>), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background’s intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images. Code is available at: <span><span>https://github.com/99Franklin/DiffText</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104224"},"PeriodicalIF":4.3,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142654044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangxiang Wang , Lixing Fang , Junli Zhao , Zhenkuan Pan , Hui Li , Yi Li
{"title":"UUD-Fusion: An unsupervised universal image fusion approach via generative diffusion model","authors":"Xiangxiang Wang , Lixing Fang , Junli Zhao , Zhenkuan Pan , Hui Li , Yi Li","doi":"10.1016/j.cviu.2024.104218","DOIUrl":"10.1016/j.cviu.2024.104218","url":null,"abstract":"<div><div>Image fusion is a classical problem in the field of image processing whose solutions are usually not unique. The common image fusion methods can only generate a fixed fusion result based on the source image pairs. They tend to be applicable only to a specific task and have high computational costs. Hence, in this paper, a two-stage unsupervised universal image fusion with generative diffusion model is proposed, termed as UUD-Fusion. For the first stage, a strategy based on the initial fusion results is devised to offload the computational effort. For the second stage, two novel sampling algorithms based on generative diffusion model are designed. The fusion sequence generation algorithm (FSGA) searches for a series of solutions in the solution space by iterative sampling. The fusion image enhancement algorithm (FIEA) greatly improves the quality of the fused images. Qualitative and quantitative evaluations of multiple datasets with different modalities demonstrate the great versatility and effectiveness of UUD-Fusion. It is capable of solving different fusion problems, including multi-focus image fusion task, multi-exposure image fusion task, infrared and visible fusion task, and medical image fusion task. The proposed approach is superior to current state-of-the-art methods. Our code is publicly available at <span><span>https://github.com/xiangxiang-wang/UUD-Fusion</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104218"},"PeriodicalIF":4.3,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}