International Journal of Computer Vision最新文献

筛选
英文 中文
DocScanner: Robust Document Image Rectification with Progressive Learning DocScanner:具有渐进式学习的鲁棒文档图像校正
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-05-26 DOI: 10.1007/s11263-025-02431-5
Hao Feng, Wengang Zhou, Jiajun Deng, Qi Tian, Houqiang Li
{"title":"DocScanner: Robust Document Image Rectification with Progressive Learning","authors":"Hao Feng, Wengang Zhou, Jiajun Deng, Qi Tian, Houqiang Li","doi":"10.1007/s11263-025-02431-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02431-5","url":null,"abstract":"<p>Compared with flatbed scanners, portable smartphones provide more convenience for physical document digitization. However, such digitized documents are often distorted due to uncontrolled physical deformations, camera positions, and illumination variations. To this end, we present DocScanner, a novel framework for document image rectification. Different from existing solutions, DocScanner addresses this issue by introducing a progressive learning mechanism. Specifically, DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture. The iterative refinements make DocScanner converge to a robust and superior rectification performance, while the lightweight recurrent architecture ensures the running efficiency. To further improve the rectification quality, based on the geometric priori between the distorted and the rectified images, a geometric constraint is introduced during training to further improve the performance. Extensive experiments are conducted on the Doc3D dataset and the DocUNet Benchmark dataset, and the quantitative and qualitative evaluation results verify the effectiveness of DocScanner, which outperforms previous methods on OCR accuracy, image similarity, and our proposed distortion metric by a considerable margin. Furthermore, our DocScanner shows superior efficiency in runtime latency and model size. The codes and pre-trained models are available at https://github.com/fh2019ustc/DocScanner.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"40 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144137180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight Structure-Aware Attention for Visual Understanding 用于视觉理解的轻量级结构感知注意
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-05-26 DOI: 10.1007/s11263-025-02475-7
Heeseung Kwon, Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Karteek Alahari
{"title":"Lightweight Structure-Aware Attention for Visual Understanding","authors":"Heeseung Kwon, Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas Guil, Karteek Alahari","doi":"10.1007/s11263-025-02475-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02475-7","url":null,"abstract":"<p>Attention operator has been widely used as a basic brick in visual understanding since it provides some flexibility through its adjustable kernels. However, this operator suffers from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose a novel attention operator, called Lightweight Structure-aware Attention (LiSA), which has a better representation power with log-linear complexity. Our operator transforms the attention kernels to be more discriminative by learning structural patterns. These structural patterns are encoded by exploiting a set of relative position embeddings (RPEs) as multiplicative weights, thereby improving the representation power of the attention kernels. Additionally, the RPEs are approximated to obtain log-linear complexity. Our experiments and analyses demonstrate that the proposed operator outperforms self-attention and other existing operators, achieving state-of-the-art results on ImageNet-1K and other downstream tasks such as video action recognition on Kinetics-400, object detection &amp; instance segmentation on COCO, and semantic segmentation on ADE-20K.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"47 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144137183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection PointOBB-v3:扩展单点监督定向目标检测的性能边界
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-05-25 DOI: 10.1007/s11263-025-02486-4
Peiyuan Zhang, Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Yue Zhou, Xiaosong Jia, Xudong Lu, Jingdong Chen, Xiang Li, Junchi Yan, Yansheng Li
{"title":"PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection","authors":"Peiyuan Zhang, Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Yue Zhou, Xiaosong Jia, Xudong Lu, Jingdong Chen, Xiang Li, Junchi Yan, Yansheng Li","doi":"10.1007/s11263-025-02486-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02486-4","url":null,"abstract":"<p>With the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to existing methods, it generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. PointOBB-v3 functions by integrating three unique image views: the original view, a resized view, and a rotated/flipped (rot/flp) view. Based on the views, a scale augmentation module and an angle acquisition module are constructed. In the first module, a Scale-Sensitive Consistency (SSC) loss and a Scale-Sensitive Feature Fusion (SSFF) module are introduced to improve the model’s ability to estimate object scale. To achieve precise angle predictions, the second module employs symmetry-based self-supervised learning. Additionally, we introduce an end-to-end version that eliminates the pseudo-label generation process by integrating a detector branch and introduces an Instance-Aware Weighting (IAW) strategy to focus on high-quality predictions. We conducted extensive experiments on the DIOR-R, DOTA-v1.0/v1.5/v2.0, FAIR1M, STAR, and RSAR datasets. Across all these datasets, our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods. The code will be available at https://github.com/ZpyWHU/PointOBB-v3.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"157 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144133671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling Scattering Effect for Under-Display Camera Image Restoration 显示下相机图像恢复的散射效应建模
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-05-25 DOI: 10.1007/s11263-025-02454-y
Binbin Song, Jiantao Zhou, Xiangyu Chen, Shuning Xu
{"title":"Modeling Scattering Effect for Under-Display Camera Image Restoration","authors":"Binbin Song, Jiantao Zhou, Xiangyu Chen, Shuning Xu","doi":"10.1007/s11263-025-02454-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02454-y","url":null,"abstract":"<p>The under-display camera (UDC) technology furnishes users with an uninterrupted full-screen viewing experience, eliminating the need for notches or punch holes. However, the translucent properties of the display lead to substantial degradation in UDC images. This work addresses the challenge of restoring UDC images by specifically targeting the scattering effect induced by the display. We explicitly model this scattering phenomenon by treating the display as a homogeneous scattering medium. Leveraging this physical model, the image formation pipeline is enhanced to synthesize more realistic UDC images alongside corresponding ground-truth images, thereby constructing a more accurate UDC dataset. To counteract the scattering effect in the restoration process, we propose a dual-branch network. The scattering branch employs channel-wise self-attention to estimate the scattering parameters, while the image branch capitalizes on the local feature representation capabilities of CNNs to restore the degraded UDC images. Additionally, we introduce a novel channel-wise cross-attention fusion block that integrates global scattering information into the image branch, facilitating improved restoration. To further refine the model, we design a dark channel regularization loss during training to reduce the gap between the dark channel distributions of the restored and ground-truth images. Comprehensive experiments conducted on both synthetic and real-world datasets demonstrate the superiority of our approach over current state-of-the-art UDC restoration methods. Our source code is publicly available at: https://github.com/NamecantbeNULL/SRUDC_pp.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144133662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local Concept Embeddings for Analysis of Concept Distributions in Vision DNN Feature Spaces 基于局部概念嵌入的视觉DNN特征空间概念分布分析
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-05-24 DOI: 10.1007/s11263-025-02446-y
Georgii Mikriukov, Gesina Schwalbe, Korinna Bade
{"title":"Local Concept Embeddings for Analysis of Concept Distributions in Vision DNN Feature Spaces","authors":"Georgii Mikriukov, Gesina Schwalbe, Korinna Bade","doi":"10.1007/s11263-025-02446-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02446-y","url":null,"abstract":"<p>Insights into the learned latent representations are imperative for verifying deep neural networks (DNNs) in critical computer vision (CV) tasks. Therefore, state-of-the-art supervised Concept-based eXplainable Artificial Intelligence (C-XAI) methods associate user-defined concepts like “car” each with a single vector in the DNN latent space (concept embedding vector). In the case of concept segmentation, these linearly separate between activation map pixels belonging to a concept and those belonging to background. Existing methods for concept segmentation, however, fall short of capturing implicitly learned sub-concepts (e.g., the DNN might split car into “proximate car” and “distant car”), and overlap of user-defined concepts (e.g., between “bus” and “truck”). In other words, they do not capture the full distribution of concept representatives in latent space. For the first time, this work shows that these simplifications are frequently broken and that distribution information can be particularly useful for understanding DNN-learned notions of sub-concepts, concept confusion, and concept outliers. To allow exploration of learned concept distributions, we propose a novel local concept analysis framework. Instead of optimizing a single global concept vector on the complete dataset, it generates a local concept embedding (LoCE) vector for each individual sample. We use the distribution formed by LoCEs to explore the latent concept distribution by fitting Gaussian mixture models (GMMs), hierarchical clustering, and concept-level information retrieval and outlier detection. Despite its context sensitivity, our method’s concept segmentation performance is competitive to global baselines. Analysis results are obtained on three datasets and six diverse vision DNN architectures, including vision transformers (ViTs). The code is available at https://github.com/continental/localconcept-embeddings.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"15 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning MIM4D:蒙面建模与多视图视频自动驾驶表示学习
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-05-24 DOI: 10.1007/s11263-025-02464-w
Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang
{"title":"MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning","authors":"Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang","doi":"10.1007/s11263-025-02464-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02464-w","url":null,"abstract":"<p>Learning robust and scalable visual representations from massive multi-view video data remains a challenge in computer vision and autonomous driving. Existing pre-training methods either rely on expensive supervised learning with 3D annotations, limiting the scalability, or focus on single-frame or monocular inputs, neglecting the temporal information, which is fundamental for the ultimate application, <i>i</i>.<i>e</i>., end-to-end planning. We propose <span>MIM4D</span>, a novel pre-training paradigm based on dual masked image modeling (MIM). <span>MIM4D</span> leverages both spatial and temporal relations by training on masked multi-view video inputs. It constructs pseudo-3D features using continuous scene flow and projects them onto 2D plane for supervision. To address the lack of dense 3D supervision, <span>MIM4D</span> reconstruct pixels by employing 3D volumetric differentiable rendering to learn geometric representations. We demonstrate that <span>MIM4D</span> achieves state-of-the-art performance on the nuScenes dataset for visual representation learning in autonomous driving. It significantly improves existing methods on multiple downstream tasks, including end-to-end planning(<span>(9%)</span> collision decrease), BEV segmentation (<span>(8.7%)</span> IoU), 3D object detection (<span>(3.5%)</span> mAP), and HD map construction (<span>(1.4%)</span> mAP). Our work offers a new choice for learning representation at scale in autonomous driving. Code and models are released at https://github.com/hustvl/MIM4D.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supplementary Prompt Learning for Vision-Language Models 视觉语言模型的补充提示学习
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-05-24 DOI: 10.1007/s11263-025-02451-1
Rongfei Zeng, Zhipeng Yang, Ruiyun Yu, Yonggang Zhang
{"title":"Supplementary Prompt Learning for Vision-Language Models","authors":"Rongfei Zeng, Zhipeng Yang, Ruiyun Yu, Yonggang Zhang","doi":"10.1007/s11263-025-02451-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02451-1","url":null,"abstract":"<p>Pre-trained vision-language models like CLIP have shown remarkable capabilities across various downstream tasks with well-tuned prompts. Advanced methods tune prompts by optimizing context while keeping the class name fixed, implicitly assuming that the class names in prompts are accurate and not missing. However, this assumption may be violated in numerous real-world scenarios, leading to potential performance degeneration or even failure of existing prompt learning methods. For example, an accurate class name for an image containing “Transformers” might be inaccurate because selecting a precise class name among numerous candidates is challenging. Moreover, assigning class names to some images may require specialized knowledge, resulting in indexing rather than semantic labels, e.g., Group 3 and Group 4 subtypes of medulloblastoma. To cope with the class-name missing issue, we propose a simple yet effective prompt learning approach, called Supplementary Optimization (SOp) for supplementing the missing class-related information. Specifically, SOp models the class names as learnable vectors while keeping the context fixed to learn prompts for downstream tasks. Extensive experiments across 18 public datasets demonstrate the efficacy of SOp when class names are missing. SOp can achieve performance comparable to that of the context optimization approach, even without using the prior information in the class names.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"45 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144130359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RigNet++: Semantic Assisted Repetitive Image Guided Network for Depth Completion rignet++:深度补全的语义辅助重复图像引导网络
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-05-23 DOI: 10.1007/s11263-025-02470-y
Zhiqiang Yan, Xiang Li, Le Hui, Zhenyu Zhang, Jun Li, Jian Yang
{"title":"RigNet++: Semantic Assisted Repetitive Image Guided Network for Depth Completion","authors":"Zhiqiang Yan, Xiang Li, Le Hui, Zhenyu Zhang, Jun Li, Jian Yang","doi":"10.1007/s11263-025-02470-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02470-y","url":null,"abstract":"<p>Depth completion aims to recover dense depth maps from sparse ones, where color images are often used to facilitate this task. Recent depth methods primarily focus on image guided learning frameworks. However, <i>blurry guidance in the image</i> and <i>unclear structure in the depth</i> still impede their performance. To tackle these challenges, we explore a repetitive design in our image guided network to gradually and sufficiently recover depth values. Specifically, the repetition is embodied in both the image guidance branch and depth generation branch. In the former branch, we design a dense repetitive hourglass network (DRHN) to extract discriminative image features of complex environments, which can provide powerful contextual instruction for depth prediction. In the latter branch, we present a repetitive guidance (RG) module based on dynamic convolution, in which an efficient convolution factorization is proposed to reduce the complexity while modeling high-frequency structures progressively. Furthermore, in the semantic guidance branch, we utilize the well-known large vision model, <i>i.e.</i>, segment anything (SAM), to supply RG with semantic prior. In addition, we propose a region-aware spatial propagation network (RASPN) for further depth refinement based on the semantic prior constraint. Finally, we collect a new dataset termed TOFDC for the depth completion task, which is acquired by the time-of-flight (TOF) sensor and the color camera on smartphones. Extensive experiments demonstrate that our method achieves state-of-the-art performance on KITTI, NYUv2, Matterport3D, 3D60, VKITTI, and our TOFDC.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144123021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Generalized Contour Vibration Model for Building Extraction 建筑物提取的广义轮廓振动模型
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-05-22 DOI: 10.1007/s11263-025-02468-6
Chunyan Xu, Shuaizhen Yao, Ziqiang Xu, Zhen Cui, Jian Yang
{"title":"A Generalized Contour Vibration Model for Building Extraction","authors":"Chunyan Xu, Shuaizhen Yao, Ziqiang Xu, Zhen Cui, Jian Yang","doi":"10.1007/s11263-025-02468-6","DOIUrl":"https://doi.org/10.1007/s11263-025-02468-6","url":null,"abstract":"<p>Classic active contour models (ACMs) are becoming a great promising solution to the contour-based object extraction with the progress of deep learning recently. Inspired by the wave vibration theory in physics, we propose a Generalized Contour Vibration Model (G-CVM) by inheriting the force and motion principle of contour wave for automatically estimating building contours. The contour estimation problems, conventionally solved by snake and level-set based ACMs, are unified to formulate as second-order partial differential equation to model the contour evolution. In parallel with the current ACM methods, we propose two types of evolution paradigms: curve-CVM and surface-CVM, from the perspective of the vibration spaces of contour waves. To tailor personalization contours for specific targets, we parameterize the constant coefficient wave differential equation through a convolutional network, and hereby integrate them into a unified learnable model for contour extraction. Through adopting finite difference optimization, we can progressively perform the contour evolution from an initial state through a recursive computation on the contour vibration model. Both the building contour evolution and the model optimization are modulated to form a close-looping end-to-end network. Besides, we make a discussion of ours <i>vs</i> the conventional ACMs, all which can be interpreted uniformly from the view of differential equation in different evolution domains. Comprehensive evaluations on several building datasets demonstrate the effectiveness and superiority of our proposed G-CVM when compared with other state-of-the-art building extraction networks and deep active contour solutions.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144113865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simplified Concrete Dropout - Improving the Generation of Attribution Masks for Fine-grained Classification 简化具体Dropout——改进细粒度分类属性掩码的生成
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-05-22 DOI: 10.1007/s11263-025-02453-z
Dimitri Korsch, Maha Shadaydeh, Joachim Denzler
{"title":"Simplified Concrete Dropout - Improving the Generation of Attribution Masks for Fine-grained Classification","authors":"Dimitri Korsch, Maha Shadaydeh, Joachim Denzler","doi":"10.1007/s11263-025-02453-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02453-z","url":null,"abstract":"<p>In fine-grained classification, which is classifying images into subcategories within a common broader category, it is crucial to have precise visual explanations of the classification model’s decision. While commonly used attention- or gradient-based methods deliver either too coarse or too noisy explanations unsuitable for highlighting subtle visual differences reliably, perturbation-based methods can precisely locate pixels causally responsible for the predicted category. The <i>fill-in of the dropout</i> (FIDO) algorithm is one of those methods, which utilizes <i>concrete dropout</i> (CD) to sample a set of attribution masks and updates the sampling parameters based on the output of the classification model. In this paper, we present a solution against the high variance in the gradient estimates, a known problem of the FIDO algorithm that has been mitigated until now by large mini-batch updates of the sampling parameters. First, our solution allows for estimating the parameters with smaller mini-batch sizes without losing the quality of the estimates but with a reduced computational effort. Next, our method produces finer and more coherent attribution masks. Finally, we use the resulting attribution masks to improve the classification performance on three fine-grained datasets without additional fine-tuning steps and achieve results that are otherwise only achieved if ground truth bounding boxes are used.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"32 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144113873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信