{"title":"Diffusion Models for Image Restoration and Enhancement: A Comprehensive Survey","authors":"Xin Li, Yulin Ren, Xin Jin, Cuiling Lan, Xingrui Wang, Wenjun Zeng, Xinchao Wang, Zhibo Chen","doi":"10.1007/s11263-025-02570-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02570-9","url":null,"abstract":"<p>Image restoration (IR) has been an indispensable and challenging task in the low-level vision field, which strives to improve the subjective quality of images distorted by various forms of degradation. Recently, the diffusion model has achieved significant advancements in the visual generation of AIGC, thereby raising an intuitive question, “whether the diffusion model can boost image restoration\". To answer this, some pioneering studies attempt to integrate diffusion models into the image restoration task, resulting in superior performances than previous GAN-based methods. Despite that, a comprehensive and enlightening survey on diffusion model-based image restoration remains scarce. In this paper, we are the first to present a comprehensive review of recent diffusion model-based methods on image restoration, encompassing the learning paradigm, conditional strategy, framework design, modeling strategy, and evaluation. Concretely, we first introduce the background of the diffusion model briefly and then present two prevalent workflows that exploit diffusion models in image restoration. Subsequently, we classify and emphasize the innovative designs using diffusion models for both IR and blind/real-world IR, intending to inspire future development. To evaluate existing methods thoroughly, we summarize the commonly used dataset, implementation details, and evaluation metrics. Additionally, we present the objective comparison for open-sourced methods across three tasks, including image super-resolution, deblurring, and inpainting. Ultimately, informed by the limitations in existing works, we propose nine potential and challenging directions for the future research of diffusion model-based IR, including sampling efficiency, model compression, distortion simulation and estimation, distortion invariant learning, and framework design. The repository is released at https://github.com/lixinustc/Awesome-diffusion-model-for-image-processing/</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"14 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenyu Wang, Xuemei Xie, Hao Luo, Tao Huang, Weisheng Dong, Kai Xiong, Yongxu Liu, Xuyang Li, Fan Wang, Guangming Shi
{"title":"Compressing Vision Transformer from the View of Model Property in Frequency Domain","authors":"Zhenyu Wang, Xuemei Xie, Hao Luo, Tao Huang, Weisheng Dong, Kai Xiong, Yongxu Liu, Xuyang Li, Fan Wang, Guangming Shi","doi":"10.1007/s11263-025-02561-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02561-w","url":null,"abstract":"<p>Vision Transformers (ViTs) have recently demonstrated significant potential in computer vision, but their high computational costs remain a challenge. To address this limitation, various methods have been proposed to compress ViTs. Most approaches utilize spatial-domain information and adapt techniques from convolutional neural networks (CNNs) pruning to reduce channels or tokens. However, differences between ViTs and CNNs in the frequency domain make these methods vulnerable to noise in the spatial domain, potentially resulting in erroneous channel or token removal and substantial performance drops. Recent studies suggest that high-frequency signals carry limited information for ViTs, and that the self-attention mechanism functions similarly to a low-pass filter. Inspired by these insights, this paper proposes a joint compression method that leverages properties of ViTs in the frequency domain. Specifically, a metric called <i>L</i>ow-<i>F</i>requency <i>S</i>ensitivity (LFS) is used to accurately identify and compress redundant channels, while a token-merging approach, assisted by <i>L</i>ow-<i>F</i>requency <i>E</i>nergy (LFE), is introduced to reduce tokens. Through joint channel and token compression, the proposed method reduces the FLOPs of ViTs by over 50% with less than a 1% performance drop on ImageNet-1K and achieves approximately a 40% reduction in FLOPs for dense prediction tasks, including object detection and semantic segmentation.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"45 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"What Do Visual Models Look At? Dilated Attention for Targeted Transferable Attacks","authors":"Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang","doi":"10.1007/s11263-025-02552-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02552-x","url":null,"abstract":"<p>Attention maps illustrate what visual models look at when processing benign images. However, when confronted with adversarial perturbations, attention undergoes significant alterations. Based on this phenomenon, previous non-targeted transferable attacks manipulate adversarial examples to generate distinct attention maps, disrupting crucial features shared among models. Nevertheless, the exploration of attention in targeted transferable attacks remains unexplored. To address this gap, we analyze alterations in attention across surrogate and black-box models, empirically observing that adversarial examples receiving more relevant features for the adversarial target label exhibit higher transferability across black-box models. Motivated by these findings, we propose the Dilated Attention (DA) attack, which integrates attention maximization loss and dynamic linear augmentation to improve targeted transferability. Attention maximization loss maximizes attention maps of the target label from multiple intermediate layers to attract greater attention. Dynamic linear augmentation leverages dynamic parameters to augment inputs with a broader range of attention maps, furnishing crafted perturbations with the robustness to dilate attention across diverse attention distributions. By considering the objective function and diverse inputs, DA generates adversarial examples with highly adversarial transferability against CNNs, ViTs, and adversarially trained models. We hope DA can serve as a foundational attack, guiding future research endeavors in the domain of targeted transferable attacks. The source code is available at: https://github.com/zhipeng-wei/DialtedAttention.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"32 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OA-DET3D: Embedding Object Awareness As A General Plug-in for Multi-Camera 3D Object Detection","authors":"Xiaomeng Chu, Jiajun Deng, Jianmin Ji, Yu Zhang, Houqiang Li, Yanyong Zhang","doi":"10.1007/s11263-025-02544-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02544-x","url":null,"abstract":"<p>The recent advance in multi-camera 3D object detection is featured by bird’s-eye view (BEV) representation or object queries. However, the ill-posed transformation from image-plane view to 3D space inevitably causes feature clutter and distortion, making the objects blur into the background. To this end, we explore how to incorporate supplementary cues for differentiating objects in the transformed feature representation. Formally, we introduce OA-DET3D, a general plug-in module that improves 3D object detection by bringing object awareness into a variety of existing 3D object detection pipelines. Specifically, OA-DET3D boosts the representation of objects by leveraging object-centric depth information and foreground pseudo points. First, we use object-level supervision from the properties of each 3D bounding box to guide the network in learning the depth distribution. Next, we select foreground pixels using a 2D object detector and project them into 3D space for pseudo-voxel feature encoding. Finally, the object-aware depth features and pseudo-voxel features are incorporated into the BEV representation or query feature from the baseline model with a deformable attention mechanism. We conduct extensive experiments on the nuScenes dataset and Argoverse 2 dataset to validate the merits of our proposed OA-DET3D. Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and comprehensive detection score. The code is available at https://github.com/cxmomo/OA-DET3D.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mitigating Knowledge Discrepancies among Multiple Datasets for Task-agnostic Unified Face Alignment","authors":"Jiahao Xia, Min Xu, Wenjian Huang, Jianguo Zhang, Haimin Zhang, Chunxia Xiao","doi":"10.1007/s11263-025-02520-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02520-5","url":null,"abstract":"<p>Despite the similar structures of human faces, existing face alignment methods cannot learn unified knowledge from multiple datasets with different landmark annotations. The limited training samples in a single dataset commonly result in fragile robustness in this field. To mitigate knowledge discrepancies among different datasets and train a task-agnostic unified face alignment (TUFA) framework, this paper presents a strategy to unify knowledge from multiple datasets. Specifically, we calculate a mean face shape for each dataset. To explicitly align these mean shapes on an interpretable plane based on their semantics, each shape is then incorporated with a group of semantic alignment embeddings. The 2D coordinates of these aligned shapes can be viewed as the anchors of the plane. By encoding them into structure prompts and further regressing the corresponding facial landmarks using image features, a mapping from the plane to the target faces is finally established, which unifies the learning target of different datasets. Consequently, multiple datasets can be utilized to boost the generalization ability of the model. The successful mitigation of discrepancies also enhances the efficiency of knowledge transferring to a novel dataset, significantly boosts the performance of few-shot face alignment. Additionally, the interpretable plane endows TUFA with a task-agnostic characteristic, enabling it to locate landmarks unseen during training in a zero-shot manner. Extensive experiments are carried on seven benchmarks and the results demonstrate an impressive improvement in face alignment brought by knowledge discrepancies mitigation. The code is available at https://github.com/Jiahao-UTS/TUFA</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"66 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zijian Zhou, Holger Caesar, Qijun Chen, Miaojing Shi
{"title":"VLPrompt-PSG: Vision-Language Prompting for Panoptic Scene Graph Generation","authors":"Zijian Zhou, Holger Caesar, Qijun Chen, Miaojing Shi","doi":"10.1007/s11263-025-02564-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02564-7","url":null,"abstract":"<p>Panoptic scene graph generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the <b>V</b>ision-<b>L</b>anguage <b>Prompt</b>ing (<b>VLPrompt</b>) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations. Code is available at https://github.com/franciszzj/VLPrompt.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"66 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation","authors":"Jingyun Wang, Guoliang Kang","doi":"10.1007/s11263-025-02566-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02566-5","url":null,"abstract":"<p>Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However, we observe that when adopting CLIP to such a pixel-level understanding task, unexpected bias (including class-preference bias and space-preference bias) occurs. Previous works don’t explicitly model the bias, which largely constrains the segmentation performance. In this paper, we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation task. Specifically, we design a learnable “Reference” prompt to encode class-preference bias and a projection of the positional embedding in the vision transformer to encode space-preference bias respectively. To avoid interference, two kinds of biases are firstly independently encoded into different features, <i>i.e.,</i> the Reference feature and the positional feature. Via a matrix multiplication between the Reference feature and the positional feature, a bias logit map is generated to explicitly represent two kinds of biases. Then we rectify the logits of CLIP via a simple element-wise subtraction. To make the rectified results smoother and more contextual, we design a mask decoder which takes the feature of CLIP and the rectified logits as input and outputs a rectified segmentation mask with the help of Gumbel-Softmax operation. A contrastive loss based on the masked visual features and the text features of different classes is imposed, which makes the bias modeling and rectification process meaningful and effective. Extensive experiments on various benchmarks including PASCAL VOC, PASCAL Context, ADE20K, Cityscapes, and COCO Stuff demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at: https://github.com/dogehhh/ReCLIP.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul Berg, Léo Buecher, Björn Michele, Minh-Tan Pham, Laetitia Chapel, Nicolas Courty
{"title":"Multi-Prototype Hyperbolic Learning Guided by Class Hierarchy","authors":"Paul Berg, Léo Buecher, Björn Michele, Minh-Tan Pham, Laetitia Chapel, Nicolas Courty","doi":"10.1007/s11263-025-02571-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02571-8","url":null,"abstract":"<p>In many computer vision applications, datasets often exhibit an underlying taxonomy within the label space. To adhere to this hierarchical structure, hyperbolic spaces have emerged as an effective manifold for representation learning, thanks to their ability to encode hierarchical relationships, with little distortion, even for low-dimensional embeddings. Hyperbolic prototypical learning, where class labels are represented by prototypes, has recently demonstrated strong potential in this setting. However, existing methods generally assume a uniform distribution of prototypes, overlooking the hierarchical organization of labels that may be available for a given task. To better exploit this prior knowledge, we propose a hierarchically informed method for prototype positioning. Our approach leverages the Gromov-Wasserstein distance to align the hierarchical relationships between labels with the initial uniform spherical distribution of prototypes, leading to more structured and semantically meaningful representations. Additionally, within a deep learning framework, we propose an alternative characterization of decision boundaries using horospheres, which are level sets of the Busemann function. Geometrically, horospheres correspond to spheres tangent to the boundary of hyperbolic space at a virtual point analogous to a prototype, which makes them a compliant tool in the prototypical learning context. Accordingly, we define a new horospherical layer that can be adapted to any neural network backbone. This layer is particularly advantageous when the number of prototypes exceeds the number of classes, offering enhanced flexibility to the classifier. Through our experiments, we demonstrate that the combination of proper initialization and optimized prototype positioning significantly enhances baseline performance for image classification on hierarchical datasets. Additionally, we validate our approach in two semantic segmentation tasks, using both image and point cloud datasets, confirming its effectiveness in leveraging hierarchical label structures for improved classification performance.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"32 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144930148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Transport with Arbitrary Prior for Dynamic Resolution Network","authors":"Zhizhong Zhang, Shujun Li, Chenyang Zhang, Lizhuang Ma, Xin Tan, Yuan Xie","doi":"10.1007/s11263-025-02483-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02483-7","url":null,"abstract":"<p>Dynamic resolution network is proved to be crucial in reducing computational redundancy by automatically assigning satisfactory resolution for each input image. However, it is observed that resolution choices are often collapsed, where prior works tend to assign images to the resolution routes whose computational cost is close to the required FLOPs. In this paper, we propose a novel optimal transport dynamic resolution network (OTD-Net) by establishing an intrinsic connection between resolution assignment and optimal transport problem. In this framework, each sample owns a resolution assignment choice viewed as supplier, and each resolution requires unallocated images considered as demander. With two assignment priors, OTD-Net benefits from the non-collapse division under theoretical support, and produces the desired assignment policy by balancing the computation budget and prediction accuracy. On that basis, a multi-resolution inference is proposed to ensemble low-resolution predictions. Extensive experiments including image classification, object detection and depth estimation, show our approach is both efficient and effective for both ResNet and Transformer, achieving state-of-the-art performance on various benchmarks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"9 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144137181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AutoViT: Achieving Real-Time Vision Transformers on Mobile via Latency-aware Coarse-to-Fine Search","authors":"Zhenglun Kong, Dongkuan Xu, Zhengang Li, Peiyan Dong, Hao Tang, Yanzhi Wang, Subhabrata Mukherjee","doi":"10.1007/s11263-025-02480-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02480-w","url":null,"abstract":"<p>Despite their impressive performance on various tasks, vision transformers (ViTs) are heavy for mobile vision applications. Recent works have proposed combining the strengths of ViTs and convolutional neural networks (CNNs) to build lightweight networks. Still, these approaches rely on hand-designed architectures with a pre-determined number of parameters. In this work, we address the challenge of finding optimal light-weight ViTs given constraints on model size and computational cost using neural architecture search. We use a search algorithm that considers both model parameters and on-device deployment latency. This method analyzes network properties, hardware memory access pattern, and degree of parallelism to directly and accurately estimate the network latency. To prevent the need for extensive testing during the search process, we use a lookup table based on a detailed breakdown of the speed of each component and operation, which can be reused to evaluate the whole latency of each search structure. Our approach leads to improved efficiency compared to testing the speed of the whole model during the search process. Extensive experiments demonstrate that, under similar parameters and FLOPs, our searched lightweight ViTs achieve higher accuracy and lower latency than state-of-the-art models. For instance, on ImageNet-1K, AutoViT_XXS (71.3% Top-1 accuracy, 10.2ms latency) outperforms MobileViTv3_XXS (71.0% Top-1 accuracy, 12.5ms latency) with 0.3% higher accuracy and 2.3ms lower latency.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"82 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144137182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}