International Journal of Computer Vision最新文献

Optimal Transport with Arbitrary Prior for Dynamic Resolution Network 动态分辨率网络的任意先验最优传输

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-26 DOI: 10.1007/s11263-025-02483-7

Zhizhong Zhang, Shujun Li, Chenyang Zhang, Lizhuang Ma, Xin Tan, Yuan Xie

{"title":"Optimal Transport with Arbitrary Prior for Dynamic Resolution Network","authors":"Zhizhong Zhang, Shujun Li, Chenyang Zhang, Lizhuang Ma, Xin Tan, Yuan Xie","doi":"10.1007/s11263-025-02483-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02483-7","url":null,"abstract":"Dynamic resolution network is proved to be crucial in reducing computational redundancy by automatically assigning satisfactory resolution for each input image. However, it is observed that resolution choices are often collapsed, where prior works tend to assign images to the resolution routes whose computational cost is close to the required FLOPs. In this paper, we propose a novel optimal transport dynamic resolution network (OTD-Net) by establishing an intrinsic connection between resolution assignment and optimal transport problem. In this framework, each sample owns a resolution assignment choice viewed as supplier, and each resolution requires unallocated images considered as demander. With two assignment priors, OTD-Net benefits from the non-collapse division under theoretical support, and produces the desired assignment policy by balancing the computation budget and prediction accuracy. On that basis, a multi-resolution inference is proposed to ensemble low-resolution predictions. Extensive experiments including image classification, object detection and depth estimation, show our approach is both efficient and effective for both ResNet and Transformer, achieving state-of-the-art performance on various benchmarks.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"9 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144137181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AutoViT: Achieving Real-Time Vision Transformers on Mobile via Latency-aware Coarse-to-Fine Search AutoViT：通过延迟感知的粗到精搜索在移动设备上实现实时视觉变形

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-26 DOI: 10.1007/s11263-025-02480-w

Zhenglun Kong, Dongkuan Xu, Zhengang Li, Peiyan Dong, Hao Tang, Yanzhi Wang, Subhabrata Mukherjee

{"title":"AutoViT: Achieving Real-Time Vision Transformers on Mobile via Latency-aware Coarse-to-Fine Search","authors":"Zhenglun Kong, Dongkuan Xu, Zhengang Li, Peiyan Dong, Hao Tang, Yanzhi Wang, Subhabrata Mukherjee","doi":"10.1007/s11263-025-02480-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02480-w","url":null,"abstract":"Despite their impressive performance on various tasks, vision transformers (ViTs) are heavy for mobile vision applications. Recent works have proposed combining the strengths of ViTs and convolutional neural networks (CNNs) to build lightweight networks. Still, these approaches rely on hand-designed architectures with a pre-determined number of parameters. In this work, we address the challenge of finding optimal light-weight ViTs given constraints on model size and computational cost using neural architecture search. We use a search algorithm that considers both model parameters and on-device deployment latency. This method analyzes network properties, hardware memory access pattern, and degree of parallelism to directly and accurately estimate the network latency. To prevent the need for extensive testing during the search process, we use a lookup table based on a detailed breakdown of the speed of each component and operation, which can be reused to evaluate the whole latency of each search structure. Our approach leads to improved efficiency compared to testing the speed of the whole model during the search process. Extensive experiments demonstrate that, under similar parameters and FLOPs, our searched lightweight ViTs achieve higher accuracy and lower latency than state-of-the-art models. For instance, on ImageNet-1K, AutoViT_XXS (71.3% Top-1 accuracy, 10.2ms latency) outperforms MobileViTv3_XXS (71.0% Top-1 accuracy, 12.5ms latency) with 0.3% higher accuracy and 2.3ms lower latency.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"82 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144137182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DocScanner: Robust Document Image Rectification with Progressive Learning DocScanner：具有渐进式学习的鲁棒文档图像校正

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-26 DOI: 10.1007/s11263-025-02431-5

Hao Feng, Wengang Zhou, Jiajun Deng, Qi Tian, Houqiang Li

{"title":"DocScanner: Robust Document Image Rectification with Progressive Learning","authors":"Hao Feng, Wengang Zhou, Jiajun Deng, Qi Tian, Houqiang Li","doi":"10.1007/s11263-025-02431-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02431-5","url":null,"abstract":"Compared with flatbed scanners, portable smartphones provide more convenience for physical document digitization. However, such digitized documents are often distorted due to uncontrolled physical deformations, camera positions, and illumination variations. To this end, we present DocScanner, a novel framework for document image rectification. Different from existing solutions, DocScanner addresses this issue by introducing a progressive learning mechanism. Specifically, DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture. The iterative refinements make DocScanner converge to a robust and superior rectification performance, while the lightweight recurrent architecture ensures the running efficiency. To further improve the rectification quality, based on the geometric priori between the distorted and the rectified images, a geometric constraint is introduced during training to further improve the performance. Extensive experiments are conducted on the Doc3D dataset and the DocUNet Benchmark dataset, and the quantitative and qualitative evaluation results verify the effectiveness of DocScanner, which outperforms previous methods on OCR accuracy, image similarity, and our proposed distortion metric by a considerable margin. Furthermore, our DocScanner shows superior efficiency in runtime latency and model size. The codes and pre-trained models are available at https://github.com/fh2019ustc/DocScanner.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"40 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144137180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lightweight Structure-Aware Attention for Visual Understanding 用于视觉理解的轻量级结构感知注意

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-26 DOI: 10.1007/s11263-025-02475-7

Heeseung Kwon, Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Karteek Alahari

{"title":"Lightweight Structure-Aware Attention for Visual Understanding","authors":"Heeseung Kwon, Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Karteek Alahari","doi":"10.1007/s11263-025-02475-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02475-7","url":null,"abstract":"Attention operator has been widely used as a basic brick in visual understanding since it provides some flexibility through its adjustable kernels. However, this operator suffers from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose a novel attention operator, called Lightweight Structure-aware Attention (LiSA), which has a better representation power with log-linear complexity. Our operator transforms the attention kernels to be more discriminative by learning structural patterns. These structural patterns are encoded by exploiting a set of relative position embeddings (RPEs) as multiplicative weights, thereby improving the representation power of the attention kernels. Additionally, the RPEs are approximated to obtain log-linear complexity. Our experiments and analyses demonstrate that the proposed operator outperforms self-attention and other existing operators, achieving state-of-the-art results on ImageNet-1K and other downstream tasks such as video action recognition on Kinetics-400, object detection & instance segmentation on COCO, and semantic segmentation on ADE-20K.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"47 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144137183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection PointOBB-v3：扩展单点监督定向目标检测的性能边界

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-25 DOI: 10.1007/s11263-025-02486-4

Peiyuan Zhang, Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Yue Zhou, Xiaosong Jia, Xudong Lu, Jingdong Chen, Xiang Li, Junchi Yan, Yansheng Li

{"title":"PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection","authors":"Peiyuan Zhang, Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Yue Zhou, Xiaosong Jia, Xudong Lu, Jingdong Chen, Xiang Li, Junchi Yan, Yansheng Li","doi":"10.1007/s11263-025-02486-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02486-4","url":null,"abstract":"With the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to existing methods, it generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. PointOBB-v3 functions by integrating three unique image views: the original view, a resized view, and a rotated/flipped (rot/flp) view. Based on the views, a scale augmentation module and an angle acquisition module are constructed. In the first module, a Scale-Sensitive Consistency (SSC) loss and a Scale-Sensitive Feature Fusion (SSFF) module are introduced to improve the model’s ability to estimate object scale. To achieve precise angle predictions, the second module employs symmetry-based self-supervised learning. Additionally, we introduce an end-to-end version that eliminates the pseudo-label generation process by integrating a detector branch and introduces an Instance-Aware Weighting (IAW) strategy to focus on high-quality predictions. We conducted extensive experiments on the DIOR-R, DOTA-v1.0/v1.5/v2.0, FAIR1M, STAR, and RSAR datasets. Across all these datasets, our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods. The code will be available at https://github.com/ZpyWHU/PointOBB-v3.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"157 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144133671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling Scattering Effect for Under-Display Camera Image Restoration 显示下相机图像恢复的散射效应建模

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-25 DOI: 10.1007/s11263-025-02454-y

Binbin Song, Jiantao Zhou, Xiangyu Chen, Shuning Xu

{"title":"Modeling Scattering Effect for Under-Display Camera Image Restoration","authors":"Binbin Song, Jiantao Zhou, Xiangyu Chen, Shuning Xu","doi":"10.1007/s11263-025-02454-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02454-y","url":null,"abstract":"The under-display camera (UDC) technology furnishes users with an uninterrupted full-screen viewing experience, eliminating the need for notches or punch holes. However, the translucent properties of the display lead to substantial degradation in UDC images. This work addresses the challenge of restoring UDC images by specifically targeting the scattering effect induced by the display. We explicitly model this scattering phenomenon by treating the display as a homogeneous scattering medium. Leveraging this physical model, the image formation pipeline is enhanced to synthesize more realistic UDC images alongside corresponding ground-truth images, thereby constructing a more accurate UDC dataset. To counteract the scattering effect in the restoration process, we propose a dual-branch network. The scattering branch employs channel-wise self-attention to estimate the scattering parameters, while the image branch capitalizes on the local feature representation capabilities of CNNs to restore the degraded UDC images. Additionally, we introduce a novel channel-wise cross-attention fusion block that integrates global scattering information into the image branch, facilitating improved restoration. To further refine the model, we design a dark channel regularization loss during training to reduce the gap between the dark channel distributions of the restored and ground-truth images. Comprehensive experiments conducted on both synthetic and real-world datasets demonstrate the superiority of our approach over current state-of-the-art UDC restoration methods. Our source code is publicly available at: https://github.com/NamecantbeNULL/SRUDC_pp.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144133662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Local Concept Embeddings for Analysis of Concept Distributions in Vision DNN Feature Spaces 基于局部概念嵌入的视觉DNN特征空间概念分布分析

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-24 DOI: 10.1007/s11263-025-02446-y

Georgii Mikriukov, Gesina Schwalbe, Korinna Bade

{"title":"Local Concept Embeddings for Analysis of Concept Distributions in Vision DNN Feature Spaces","authors":"Georgii Mikriukov, Gesina Schwalbe, Korinna Bade","doi":"10.1007/s11263-025-02446-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02446-y","url":null,"abstract":"Insights into the learned latent representations are imperative for verifying deep neural networks (DNNs) in critical computer vision (CV) tasks. Therefore, state-of-the-art supervised Concept-based eXplainable Artificial Intelligence (C-XAI) methods associate user-defined concepts like “car” each with a single vector in the DNN latent space (concept embedding vector). In the case of concept segmentation, these linearly separate between activation map pixels belonging to a concept and those belonging to background. Existing methods for concept segmentation, however, fall short of capturing implicitly learned sub-concepts (e.g., the DNN might split car into “proximate car” and “distant car”), and overlap of user-defined concepts (e.g., between “bus” and “truck”). In other words, they do not capture the full distribution of concept representatives in latent space. For the first time, this work shows that these simplifications are frequently broken and that distribution information can be particularly useful for understanding DNN-learned notions of sub-concepts, concept confusion, and concept outliers. To allow exploration of learned concept distributions, we propose a novel local concept analysis framework. Instead of optimizing a single global concept vector on the complete dataset, it generates a local concept embedding (LoCE) vector for each individual sample. We use the distribution formed by LoCEs to explore the latent concept distribution by fitting Gaussian mixture models (GMMs), hierarchical clustering, and concept-level information retrieval and outlier detection. Despite its context sensitivity, our method’s concept segmentation performance is competitive to global baselines. Analysis results are obtained on three datasets and six diverse vision DNN architectures, including vision transformers (ViTs). The code is available at https://github.com/continental/localconcept-embeddings.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"15 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning MIM4D：蒙面建模与多视图视频自动驾驶表示学习

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-24 DOI: 10.1007/s11263-025-02464-w

Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang

{"title":"MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning","authors":"Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang","doi":"10.1007/s11263-025-02464-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02464-w","url":null,"abstract":"Learning robust and scalable visual representations from massive multi-view video data remains a challenge in computer vision and autonomous driving. Existing pre-training methods either rely on expensive supervised learning with 3D annotations, limiting the scalability, or focus on single-frame or monocular inputs, neglecting the temporal information, which is fundamental for the ultimate application, i.e., end-to-end planning. We propose MIM4D, a novel pre-training paradigm based on dual masked image modeling (MIM). MIM4D leverages both spatial and temporal relations by training on masked multi-view video inputs. It constructs pseudo-3D features using continuous scene flow and projects them onto 2D plane for supervision. To address the lack of dense 3D supervision, MIM4D reconstruct pixels by employing 3D volumetric differentiable rendering to learn geometric representations. We demonstrate that MIM4D achieves state-of-the-art performance on the nuScenes dataset for visual representation learning in autonomous driving. It significantly improves existing methods on multiple downstream tasks, including end-to-end planning((9%) collision decrease), BEV segmentation ((8.7%) IoU), 3D object detection ((3.5%) mAP), and HD map construction ((1.4%) mAP). Our work offers a new choice for learning representation at scale in autonomous driving. Code and models are released at https://github.com/hustvl/MIM4D.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Supplementary Prompt Learning for Vision-Language Models 视觉语言模型的补充提示学习

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-24 DOI: 10.1007/s11263-025-02451-1

Rongfei Zeng, Zhipeng Yang, Ruiyun Yu, Yonggang Zhang

{"title":"Supplementary Prompt Learning for Vision-Language Models","authors":"Rongfei Zeng, Zhipeng Yang, Ruiyun Yu, Yonggang Zhang","doi":"10.1007/s11263-025-02451-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02451-1","url":null,"abstract":"Pre-trained vision-language models like CLIP have shown remarkable capabilities across various downstream tasks with well-tuned prompts. Advanced methods tune prompts by optimizing context while keeping the class name fixed, implicitly assuming that the class names in prompts are accurate and not missing. However, this assumption may be violated in numerous real-world scenarios, leading to potential performance degeneration or even failure of existing prompt learning methods. For example, an accurate class name for an image containing “Transformers” might be inaccurate because selecting a precise class name among numerous candidates is challenging. Moreover, assigning class names to some images may require specialized knowledge, resulting in indexing rather than semantic labels, e.g., Group 3 and Group 4 subtypes of medulloblastoma. To cope with the class-name missing issue, we propose a simple yet effective prompt learning approach, called Supplementary Optimization (SOp) for supplementing the missing class-related information. Specifically, SOp models the class names as learnable vectors while keeping the context fixed to learn prompts for downstream tasks. Extensive experiments across 18 public datasets demonstrate the efficacy of SOp when class names are missing. SOp can achieve performance comparable to that of the context optimization approach, even without using the prior information in the class names.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"45 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RigNet++: Semantic Assisted Repetitive Image Guided Network for Depth Completion rignet++：深度补全的语义辅助重复图像引导网络

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-23 DOI: 10.1007/s11263-025-02470-y

Zhiqiang Yan, Xiang Li, Le Hui, Zhenyu Zhang, Jun Li, Jian Yang

{"title":"RigNet++: Semantic Assisted Repetitive Image Guided Network for Depth Completion","authors":"Zhiqiang Yan, Xiang Li, Le Hui, Zhenyu Zhang, Jun Li, Jian Yang","doi":"10.1007/s11263-025-02470-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02470-y","url":null,"abstract":"Depth completion aims to recover dense depth maps from sparse ones, where color images are often used to facilitate this task. Recent depth methods primarily focus on image guided learning frameworks. However, blurry guidance in the image and unclear structure in the depth still impede their performance. To tackle these challenges, we explore a repetitive design in our image guided network to gradually and sufficiently recover depth values. Specifically, the repetition is embodied in both the image guidance branch and depth generation branch. In the former branch, we design a dense repetitive hourglass network (DRHN) to extract discriminative image features of complex environments, which can provide powerful contextual instruction for depth prediction. In the latter branch, we present a repetitive guidance (RG) module based on dynamic convolution, in which an efficient convolution factorization is proposed to reduce the complexity while modeling high-frequency structures progressively. Furthermore, in the semantic guidance branch, we utilize the well-known large vision model, i.e., segment anything (SAM), to supply RG with semantic prior. In addition, we propose a region-aware spatial propagation network (RASPN) for further depth refinement based on the semantic prior constraint. Finally, we collect a new dataset termed TOFDC for the depth completion task, which is acquired by the time-of-flight (TOF) sensor and the color camera on smartphones. Extensive experiments demonstrate that our method achieves state-of-the-art performance on KITTI, NYUv2, Matterport3D, 3D60, VKITTI, and our TOFDC.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144123021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0