{"title":"Cleanness-navigated-contamination network: A unified framework for recovering regional degradation","authors":"Qianhao Yu, Naishan Zheng, Jie Huang, Feng Zhao","doi":"10.1016/j.cviu.2024.104274","DOIUrl":"10.1016/j.cviu.2024.104274","url":null,"abstract":"<div><div>Image restoration from regional degradation has long been an important and challenging task. The key to contamination removal is recovering the contents of the corrupted regions with the guidance of the non-corrupted regions. Due to the inadequate long-range modeling, the CNN-based approaches cannot thoroughly investigate the information from non-corrupted regions, resulting in distorted visuals with artificial traces between different regions. To address this issue, we propose a novel Cleanness-Navigated-Contamination Network (CNCNet), which is a unified framework for recovering regional image contamination, such as shadow, flare, and other regional degradation. Our method mainly consists of two components: a contamination-oriented adaptive normalization (COAN) module and a contamination-aware aggregation with transformer (CAAT) module based on the contamination region mask. Under the guidance of the contamination mask, the COAN module formulates the statistics from the non-corrupted region and adaptively applies them to the corrupted region for region-wise restoration. The CAAT module utilizes the region mask to precisely guide the restoration of each contaminated pixel by considering the highly relevant pixels from the contamination-free regions for global pixel-wise restoration. Extensive experiments in both shadow removal tasks and flare removal tasks show that our network framework achieves superior restoration performance.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104274"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Full-body virtual try-on using top and bottom garments with wearing style control","authors":"Soonchan Park , Jinah Park","doi":"10.1016/j.cviu.2024.104259","DOIUrl":"10.1016/j.cviu.2024.104259","url":null,"abstract":"<div><div>Various studies have been proposed to synthesize realistic images for image-based virtual try-on, but most of them are limited to replacing a single item on a given model, without considering wearing styles. In this paper, we address the novel problem of <em>full-body</em> virtual try-on with <em>multiple</em> garments by introducing a new benchmark dataset and an image synthesis method. Our Fashion-TB dataset provides comprehensive clothing information by mapping fashion models to their corresponding top and bottom garments, along with semantic region annotations to represent the structure of the garments. WGF-VITON, the single-stage network we have developed, generates full-body try-on images using top and bottom garments simultaneously. Instead of relying on preceding networks to estimate intermediate knowledge, modules for garment transformation and image synthesis are integrated and trained through end-to-end learning. Furthermore, our method proposes Wearing-guide scheme to control the wearing styles in the synthesized try-on images. Through various experiments, for the full-body virtual try-on task, WGF-VITON outperforms state-of-the-art networks in both quantitative and qualitative evaluations with an optimized number of parameters while allowing users to control the wearing styles of the output images. The code and data are available at <span><span>https://github.com/soonchanpark/WGF-VITON</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104259"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DM-Align: Leveraging the power of natural language instructions to make changes to images","authors":"Maria-Mihaela Trusca , Tinne Tuytelaars , Marie-Francine Moens","doi":"10.1016/j.cviu.2025.104292","DOIUrl":"10.1016/j.cviu.2025.104292","url":null,"abstract":"<div><div>Text-based semantic image editing assumes the manipulation of an image using a natural language instruction. Although recent works are capable of generating creative and qualitative images, the problem is still mostly approached as a black box sensitive to generating unexpected outputs. Therefore, we propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. The proposed Diffusion Masking with word Alignments (DM-Align) allows the editing of an image in a transparent and explainable way. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream. When comparing to state-of-the-art baselines, quantitative and qualitative results show that DM-Align has superior performance in image editing conditioned on language instructions, well preserves the background of the image and can better cope with long text instructions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104292"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuhui Chang, Junhai Zhai, Shaoxin Qiu, Zhengrong Sun
{"title":"Rebalanced supervised contrastive learning with prototypes for long-tailed visual recognition","authors":"Xuhui Chang, Junhai Zhai, Shaoxin Qiu, Zhengrong Sun","doi":"10.1016/j.cviu.2025.104291","DOIUrl":"10.1016/j.cviu.2025.104291","url":null,"abstract":"<div><div>In the real world, data often follows a long-tailed distribution, resulting in head classes receiving more attention while tail classes are frequently overlooked. Although supervised contrastive learning (SCL) performs well on balanced datasets, it struggles to distinguish features between tail classes in the latent space when dealing with long-tailed data. To address this issue, we propose Rebalanced Supervised Contrastive Learning (ReCL), which can effectively enhance the separability of tail classes features. Compared with two state-of-the-art methods, Contrastive Learning based hybrid networks (Hybrid-SC) and Targeted Supervised Contrastive Learning (TSC), ReCL has two distinctive characteristics: (1) ReCL enhances the clarity of classification boundaries between tail classes by encouraging samples to align more closely with their corresponding prototypes. (2) ReCL does not require targets generation, thereby conserving computational resources. Our method significantly improves the recognition of tail classes, demonstrating competitive accuracy across multiple long-tailed datasets. Our code has been uploaded to <span><span>https://github.com/cxh981110/ReCL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104291"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph-based Dense Event Grounding with relative positional encoding","authors":"Jianxiang Dong, Zhaozheng Yin","doi":"10.1016/j.cviu.2024.104257","DOIUrl":"10.1016/j.cviu.2024.104257","url":null,"abstract":"<div><div>Temporal Sentence Grounding (TSG) in videos aims to localize a temporal moment from an untrimmed video that is relevant to a given query sentence. Most existing methods focus on addressing the problem of single sentence grounding. Recently, researchers proposed a new Dense Event Grounding (DEG) problem by extending the single event localization to a multi-event localization, where the temporal moments of multiple events described by multiple sentences are retrieved. In this paper, we introduce an effective proposal-based approach to solve the DEG problem. A Relative Sentence Interaction (RSI) module using graph neural network is proposed to model the event relationship by introducing a temporal relative positional encoding to learn the relative temporal order information between sentences in a dense multi-sentence query. In addition, we design an Event-contextualized Cross-modal Interaction (ECI) module to tackle the lack of global information from other related events when fusing visual and sentence features. Finally, we construct an Event Graph (EG) with intra-event edges and inter-event edges to model the relationship between proposals in the same event and proposals in different events to further refine their representations for final localizations. Extensive experiments on ActivityNet-Captions and TACoS datasets show the effectiveness of our solution.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104257"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Donghyeon Lee , Eunho Lee , Jaehyuk Kang, Youngbae Hwang
{"title":"Pruning networks at once via nuclear norm-based regularization and bi-level optimization","authors":"Donghyeon Lee , Eunho Lee , Jaehyuk Kang, Youngbae Hwang","doi":"10.1016/j.cviu.2024.104247","DOIUrl":"10.1016/j.cviu.2024.104247","url":null,"abstract":"<div><div>Most network pruning methods focus on identifying redundant channels from pre-trained models, which is inefficient due to its three-step process: pre-training, pruning and fine-tuning, and reconfiguration. In this paper, we propose a pruning-from-scratch framework that unifies these processes into a single approach. We introduce nuclear norm-based regularization to maintain the representational capacity of large networks during pruning. Combining this with MACs-based regularization enhances the performance of the pruned network at the target compression rate. Our bi-level optimization approach simultaneously improves pruning efficiency and representation capacity. Experimental results show that our method achieves 75.4% accuracy on ImageNet without a pre-trained network, using only 41% of the original model’s computational cost. It also attains 0.5% higher performance in compressing the SSD network for object detection. Furthermore, we analyze the effects of nuclear norm-based regularization.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104247"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic-preserved point-based human avatar","authors":"Lixiang Lin, Jianke Zhu","doi":"10.1016/j.cviu.2025.104307","DOIUrl":"10.1016/j.cviu.2025.104307","url":null,"abstract":"<div><div>To enable realistic experience in AR/VR and digital entertainment, we present the first point-based human avatar model that embodies the entirety expressive range of digital humans. Specifically, we employ two MLPs to model pose-dependent deformation and linear skinning (LBS) weights. The representation of appearance relies on a decoder and the features attached to each point. In contrast to alternative implicit approaches, the oriented points representation not only provides a more intuitive way to model human avatar animation but also significantly reduces the computational time on both training and inference. Moreover, we propose a novel method to transfer semantic information from the SMPL-X model to the points, which enables to better understand human body movements. By leveraging the semantic information of points, we can facilitate virtual try-on and human avatar composition through exchanging the points of same category across different subjects. Experimental results demonstrate the efficacy of our presented method. Our implementation is publicly available at <span><span>https://github.com/l1346792580123/spa</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104307"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jikang Cheng, Baojin Huang, Yan Fang, Zhen Han, Zhongyuan Wang
{"title":"Adversarial intensity awareness for robust object detection","authors":"Jikang Cheng, Baojin Huang, Yan Fang, Zhen Han, Zhongyuan Wang","doi":"10.1016/j.cviu.2024.104252","DOIUrl":"10.1016/j.cviu.2024.104252","url":null,"abstract":"<div><div>Like other computer vision models, object detectors are vulnerable to adversarial examples (AEs) containing imperceptible perturbations. These AEs can be generated with multiple intensities and then used to attack object detectors in real-world scenarios. One of the most effective ways to improve the robustness of object detectors is adversarial training (AT), which incorporates AEs into the training process. However, while previous AT-based models have shown certain robustness against adversarial attacks of a pre-specific intensity, they still struggle to maintain robustness when defending against adversarial attacks with multiple intensities. To address this issue, we propose a novel robust object detection method based on adversarial intensity awareness. We first explore potential schema to define the relationship between the neglected intensity information and actual evaluation metrics in AT. Then, we propose the sequential intensity loss (SI Loss) to represent and leverage the neglected intensity information in the AEs. Specifically, SI Loss deploys a sequential adaptive strategy to transform intensity into concrete learnable metrics in a discrete and cumulative manner. Additionally, a boundary smoothing algorithm is introduced to mitigate the influence of some particular AEs that challenging to be divided into a certain intensity level. Extensive experiments on PASCAL VOC and MS-COCO datasets substantially demonstrate the superior performance of our method over other defense methods against multi-intensity adversarial attacks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104252"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joint Generating Terminal Correction Imaging method for modular LED integral imaging systems","authors":"Tianshu Li, Shigang Wang","doi":"10.1016/j.cviu.2025.104279","DOIUrl":"10.1016/j.cviu.2025.104279","url":null,"abstract":"<div><div>Integral imaging has garnered significant attention in 3D display technology due to its potential for high-quality visualization. However, elemental images in integral imaging systems usually suffer from misalignment due to the mechanical or human-induced assembly within the lens arrays, leading to undesirable display quality. This paper introduces a novel Joint-Generating Terminal Correction Imaging (JGTCI) approach tailored for large-scale, modular LED integral imaging systems to address the misalignment between the optical centers of physical lens arrays and the camera in generated elemental image arrays. Specifically, we propose: (1) a high-sensitivity calibration marker to enhance alignment precision by accurately matching lens centers to the central points of elemental images; (2) a partitioned calibration strategy that supports independent calibration of display sections, enabling seamless system expansion without recalibrating previously adjusted regions; and (3) a calibration setup where markers are strategically placed near the lens focal length, ensuring optimal pixel coverage in the camera frame for improved accuracy. Extensive experimental results demonstrate that our JGTCI approach significantly enhances 3D display accuracy, extends the viewing angle, and improves the scalability and practicality of modular integral imaging systems, outperforming recent state-of-the-art methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"252 ","pages":"Article 104279"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143101031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}