International Journal of Computer Vision最新文献

筛选
英文 中文
Preconditioned Score-Based Generative Models
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-21 DOI: 10.1007/s11263-025-02410-w
Hengyuan Ma, Xiatian Zhu, Jianfeng Feng, Li Zhang
{"title":"Preconditioned Score-Based Generative Models","authors":"Hengyuan Ma, Xiatian Zhu, Jianfeng Feng, Li Zhang","doi":"10.1007/s11263-025-02410-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02410-w","url":null,"abstract":"<p>Score-based generative models (SGMs) have recently emerged as a promising class of generative models. However, a fundamental limitation is that their sampling process is slow due to a need for many (e.g., 2000) iterations of sequential computations. An intuitive acceleration method is to reduce the sampling iterations which however causes severe performance degradation. We assault this problem to the ill-conditioned issues of the Langevin dynamics and reverse diffusion in the sampling process. Under this insight, we propose a novel <b><i>preconditioned diffusion sampling</i></b> (PDS) method that leverages matrix preconditioning to alleviate the aforementioned problem. PDS alters the sampling process of a vanilla SGM at marginal extra computation cost and without model retraining. Theoretically, we prove that PDS preserves the output distribution of the SGM, with no risk of inducing systematical bias to the original sampling process. We further theoretically reveal a relation between the parameter of PDS and the sampling iterations, easing the parameter estimation under varying sampling iterations. Extensive experiments on various image datasets with a variety of resolutions and diversity validate that our PDS consistently accelerates off-the-shelf SGMs whilst maintaining the synthesis quality. In particular, PDS can accelerate by up to <span>(28times )</span> on more challenging high-resolution (1024<span>(times )</span>1024) image generation. Compared with the latest generative models (e.g., CLD-SGM, DDIM, and Analytic-DDIM), PDS can achieve the best sampling quality on CIFAR-10 at an FID score of 1.99. Our code is publicly available to foster any further research https://github.com/fudan-zvg/PDS.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"183 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143672810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CT3D++: Improving 3D Object Detection with Keypoint-Induced Channel-wise Transformer
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-20 DOI: 10.1007/s11263-025-02404-8
Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Qiao Liang, Min-Jian Zhao, Jieping Ye
{"title":"CT3D++: Improving 3D Object Detection with Keypoint-Induced Channel-wise Transformer","authors":"Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Qiao Liang, Min-Jian Zhao, Jieping Ye","doi":"10.1007/s11263-025-02404-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02404-8","url":null,"abstract":"<p>The field of 3D object detection from point clouds is rapidly advancing in computer vision, aiming to accurately and efficiently detect and localize objects in three-dimensional space. Current 3D detectors commonly fall short in terms of flexibility and scalability, with ample room for advancements in performance. In this paper, our objective is to address these limitations by introducing two frameworks for 3D object detection. Firstly, we propose CT3D, which sequentially performs raw-point-based embedding, a standard Transformer encoder, and a channel-wise decoder for point features within each proposal. Secondly, we present an enhanced network called CT3D++, which incorporates geometric and semantic fusion-based embedding to extract more valuable and comprehensive proposal-aware information. Additionally, CT3D++ utilizes a point-to-key bidirectional encoder for more efficient feature encoding with reduced computational cost. By replacing the corresponding components of CT3D with these novel modules, CT3D++ achieves state-of-the-art performance on both the KITTI dataset and the large-scale Waymo Open Dataset. The source code for our frameworks will be made accessible at https://github.com/hlsheng1/CT3Dplusplus.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143666253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PointSea: Point Cloud Completion via Self-structure Augmentation
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-19 DOI: 10.1007/s11263-025-02400-y
Zhe Zhu, Honghua Chen, Xing He, Mingqiang Wei
{"title":"PointSea: Point Cloud Completion via Self-structure Augmentation","authors":"Zhe Zhu, Honghua Chen, Xing He, Mingqiang Wei","doi":"10.1007/s11263-025-02400-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02400-y","url":null,"abstract":"<p>Point cloud completion is a fundamental yet not well-solved problem in 3D vision. Current approaches often rely on 3D coordinate information and/or additional data (e.g., images and scanning viewpoints) to fill in missing parts. Unlike these methods, we explore self-structure augmentation and propose <b>PointSea</b> for global-to-local point cloud completion. In the global stage, consider how we inspect a defective region of a physical object, we may observe it from various perspectives for a better understanding. Inspired by this, PointSea augments data representation by leveraging self-projected depth images from multiple views. To reconstruct a compact global shape from the cross-modal input, we incorporate a feature fusion module to fuse features at both intra-view and inter-view levels. In the local stage, to reveal highly detailed structures, we introduce a point generator called the self-structure dual-generator. This generator integrates both learned shape priors and geometric self-similarities for shape refinement. Unlike existing efforts that apply a unified strategy for all points, our dual-path design adapts refinement strategies conditioned on the structural type of each point, addressing the specific incompleteness of each point. Comprehensive experiments on widely-used benchmarks demonstrate that PointSea effectively understands global shapes and generates local details from incomplete input, showing clear improvements over existing methods. Our code is available at https://github.com/czvvd/SVDFormer_PointSea.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning to Generalize Heterogeneous Representation for Cross-Modality Image Synthesis via Multiple Domain Interventions
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-19 DOI: 10.1007/s11263-025-02381-y
Yawen Huang, Huimin Huang, Hao Zheng, Yuexiang Li, Feng Zheng, Xiantong Zhen, Yefeng Zheng
{"title":"Learning to Generalize Heterogeneous Representation for Cross-Modality Image Synthesis via Multiple Domain Interventions","authors":"Yawen Huang, Huimin Huang, Hao Zheng, Yuexiang Li, Feng Zheng, Xiantong Zhen, Yefeng Zheng","doi":"10.1007/s11263-025-02381-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02381-y","url":null,"abstract":"<p>Magnetic resonance imaging with modality diversity substantially increases productivity in routine diagnosis and advanced research. However, high inter-equipment variability and expensive examination cost remain as key challenges in acquiring and utilizing multi-modal images. Missing modalities often can be synthesized from existing ones. While the rapid growth in image style transfer with deep models overwhelms the above endeavor, such image synthesis may not always be achievable and even impractical when applied to medical data. The proposed method addresses this issue by a convolutional sparse coding (CSC) adaptation network to handle the lacking of generalizing medical image representation learning. We reduce both inter-domain and intra-domain divergences by the domain-adaptation and domain-standardization modules, respectively. On the basis of CSC features, we penalize their subspace mismatching to reduce the generalization error. The overall framework is cast in a minimax setting, and the extensive experiments show that the proposed method yields state-of-the-art results on multiple datasets.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LR-ASD: Lightweight and Robust Network for Active Speaker Detection
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-19 DOI: 10.1007/s11263-025-02399-2
Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, Yanru Chen
{"title":"LR-ASD: Lightweight and Robust Network for Active Speaker Detection","authors":"Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, Yanru Chen","doi":"10.1007/s11263-025-02399-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02399-2","url":null,"abstract":"<p>Active speaker detection is a challenging task aimed at identifying who is speaking. Due to the critical importance of this task in numerous applications, it has received considerable attention. Existing studies endeavor to enhance performance at any cost by inputting information from multiple candidates and designing complex models. While these methods have achieved excellent performance, their substantial memory and computational demands pose challenges for their application to resource-limited scenarios. Therefore, in this study, a lightweight and robust network for active speaker detection, named LR-ASD, is constructed by reducing the number of input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, using a simple channel attention module for multi-modal feature fusion, and applying gated recurrent unit (GRU) with low computational complexity for temporal modeling. Results on the AVA-ActiveSpeaker dataset reveal that LR-ASD achieves competitive mean Average Precision (mAP) performance (94.5% vs. 95.2%), while the resource costs are significantly lower than the state-of-the-art method, particularly in terms of model parameters (0.84 M vs. 34.33 M, approximately 41 times) and floating point operations (FLOPs) (0.51 G vs. 4.86 G, approximately 10 times). Additionally, LR-ASD demonstrates excellent robustness by achieving state-of-the-art performance on the Talkies, Columbia, and RealVAD datasets in cross-dataset testing without fine-tuning. The project is available at https://github.com/Junhua-Liao/LR-ASD.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"124 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143661509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-19 DOI: 10.1007/s11263-025-02407-5
Pengcheng Zhang, Xiaohan Yu, Xiao Bai, Jin Zheng, Xin Ning, Edwin R. Hancock
{"title":"Fully Decoupled End-to-End Person Search: An Approach without Conflicting Objectives","authors":"Pengcheng Zhang, Xiaohan Yu, Xiao Bai, Jin Zheng, Xin Ning, Edwin R. Hancock","doi":"10.1007/s11263-025-02407-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02407-5","url":null,"abstract":"<p>End-to-end person search aims to jointly detect and re-identify a target person in raw scene images with a unified model. The detection sub-task learns to identify all persons as one category while the re-identification (re-id) sub-task aims to discriminate persons of different identities, resulting in conflicting optimal objectives. Existing works proposed to decouple end-to-end person search to alleviate such conflict. Yet these methods are still sub-optimal on the sub-tasks due to their partially decoupled models, which limits the overall person search performance. To further eliminate the last coupled part in decoupled models without sacrificing the efficiency of end-to-end person search, we propose a fully decoupled person search framework in this work. Specifically, we design a task-incremental network to construct an end-to-end model in a task-incremental learning procedure. Given that the detection subtask is easier, we start by training a lightweight detection sub-network and expand it with a re-id sub-network trained in another stage. On top of the fully decoupled design, we also enable one-stage training for the task-incremental network. The fully decoupled framework further allows an Online Representation Distillation to mitigate the representation gap between the end-to-end model and two-step models for learning robust representations. Without requiring an offline teacher re-id model, this transfers structured representational knowledge learned from cropped images to the person search model. The learned person representations thus focus more on discriminative clues of foreground persons and suppress the distractive background information. To understand the effectiveness and efficiency of the proposed method, we conduct comprehensive experimental evaluations on two popular person search datasets PRW and CUHK-SYSU. The experimental results demonstrate that the fully decoupled model achieves superior performance than previous decoupled methods. The inference of the model is also shown to be efficient among recent end-to-end methods. The source code is available at https://github.com/PatrickZad/fdps.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-18 DOI: 10.1007/s11263-025-02396-5
Arthur Josi, Mahdi Alehdaghi, Rafael M. O. Cruz, Eric Granger
{"title":"Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data","authors":"Arthur Josi, Mahdi Alehdaghi, Rafael M. O. Cruz, Eric Granger","doi":"10.1007/s11263-025-02396-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02396-5","url":null,"abstract":"<p>Visible-infrared person re-identification (V-I ReID) seeks to match images of individuals captured over a distributed network of RGB and IR cameras. The task is challenging due to the significant differences between V and I modalities, especially under real-world conditions, where images face corruptions such as blur, noise, and weather. Despite their practical relevance, deep learning models for multimodal V-I ReID remain far less investigated than for single and cross-modal V to I settings. Moreover, state-of-art V-I ReID models cannot leverage corrupted modality information to sustain a high level of accuracy. In this paper, we propose an efficient model for multimodal V-I ReID – named Multimodal Middle Stream Fusion (MMSF) – that preserves modality-specific knowledge for improved robustness to corrupted multimodal images. In addition, three state-of-art attention-based multimodal fusion models are adapted to address corrupted multimodal data in V-I ReID, allowing for dynamic balancing of the importance of each modality. The literature typically reports ReID performance using clean datasets, but more recently, evaluation protocols have been proposed to assess the robustness of ReID models under challenging real-world scenarios, using data with realistic corruptions. However, these protocols are limited to unimodal V settings. For realistic evaluation of multimodal (and cross-modal) V-I person ReID models, we propose new challenging corrupted datasets for scenarios where V and I cameras are co-located (CL) and not co-located (NCL). Finally, the benefits of our Masking and Local Multimodal Data Augmentation (ML-MDA) strategy are explored to improve the robustness of ReID models to multimodal corruption. Our experiments on clean and corrupted versions of the SYSU-MM01, RegDB, and ThermalWORLD datasets indicate the multimodal V-I ReID models that are more likely to perform well in real-world operational conditions. In particular, the proposed ML-MDA is shown as essential for a V-I person ReID system to sustain high accuracy and robustness in face of corrupted multimodal images. Our multimodal ReID models attains the best accuracy and complexity trade-off under both CL and NCL settings and compared to state-of-art unimodal ReID systems, except for the ThermalWORLD dataset due to its low-quality I. Our MMSF model outperforms every method under CL and NCL camera scenarios. GitHub code: https://github.com/art2611/MREiD-UCD-CCD.git.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"183 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143640779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Solution to Co-occurrence Bias in Pedestrian Attribute Recognition: Theory, Algorithms, and Improvements
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-18 DOI: 10.1007/s11263-025-02405-7
Yibo Zhou, Hai-Miao Hu, Jinzuo Yu, Haotian Wu, Shiliang Pu, Hanzi Wang
{"title":"A Solution to Co-occurrence Bias in Pedestrian Attribute Recognition: Theory, Algorithms, and Improvements","authors":"Yibo Zhou, Hai-Miao Hu, Jinzuo Yu, Haotian Wu, Shiliang Pu, Hanzi Wang","doi":"10.1007/s11263-025-02405-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02405-7","url":null,"abstract":"<p>For the pedestrian attributes recognition, we demonstrate that deep models can memorize the pattern of attributes co-occurrences inherent to dataset, whether through explicit or implicit means. However, since the attributes interdependency is highly variable and unpredictable across different scenarios, the modeled attributes co-occurrences de facto serve as a data selection bias that hardly generalizes onto out-of-distribution samples. To address this thorny issue, we formulate a novel concept of attributes-disentangled feature learning, by which the mutual information among features of different attributes is minimized, ensuring the recognition of an attribute independent to the presence of others. Stemming from it, practical approaches are developed to effectively decouple attributes by suppressing the shared feature factors among attributes-specific features. As compelling merits, our method is exercised with minimal test-time computation, and is also highly extendable. With slight modifications on it, further improvements regarding better exploration of the feature space, softening the issue of imbalanced attributes distribution in dataset and flexibility in term of preserving certain causal attributes interdependencies can be achieved. Comprehensive experiments on various realistic datasets, such as PA100k, PETAzs and RAPzs, validate the efficacy and a spectrum of superiorities of our method.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"70 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-17 DOI: 10.1007/s11263-025-02409-3
Zeyu Wang, Libo Zhao, Jizheng Zhang, Rui Song, Haiyu Song, Jiana Meng, Shidong Wang
{"title":"Multi-Text Guidance Is Important: Multi-Modality Image Fusion via Large Generative Vision-Language Model","authors":"Zeyu Wang, Libo Zhao, Jizheng Zhang, Rui Song, Haiyu Song, Jiana Meng, Shidong Wang","doi":"10.1007/s11263-025-02409-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02409-3","url":null,"abstract":"<p>Multi-modality image fusion aims to extract complementary features from multiple source images of different modalities, generating a fused image that inherits their advantages. To address challenges in cross-modality shared feature (CMSF) extraction, single-modality specific feature (SMSF) fusion, and the absence of ground truth (GT) images, we propose MTG-Fusion, a multi-text guided model. We leverage the capabilities of large vision-language models to generate text descriptions tailored to the input images, providing novel insights for these challenges. Our model introduces a text-guided CMSF extractor (TGCE) and a text-guided SMSF fusion module (TGSF). TGCE transforms visual features into the text domain using manifold-isometric domain transform techniques and provides effective visual-text interaction based on text-vision and text-text distances. TGSF fuses each dimension of visual features with corresponding text features, creating a weight matrix utilized for SMSF fusion. We also incorporate the constructed textual GT into the loss function for collaborative training. Extensive experiments demonstrate that MTG-Fusion achieves state-of-the-art performance on infrared and visible image fusion and medical image fusion tasks. The code is available at: https://github.com/zhaolb4080/MTG-Fusion.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"90 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143640778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Not All Pixels are Equal: Learning Pixel Hardness for Semantic Segmentation
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-17 DOI: 10.1007/s11263-025-02416-4
Xin Xiao, Daiguo Zhou, Jiagao Hu, Yi Hu, Yongchao Xu
{"title":"Not All Pixels are Equal: Learning Pixel Hardness for Semantic Segmentation","authors":"Xin Xiao, Daiguo Zhou, Jiagao Hu, Yi Hu, Yongchao Xu","doi":"10.1007/s11263-025-02416-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02416-4","url":null,"abstract":"<p>Semantic segmentation has witnessed great progress. Despite the impressive overall results, the segmentation performance in some hard areas (<i>e.g.</i>, small objects or thin parts) is still not promising. A straightforward solution is hard sample mining. Yet, most existing hard pixel mining strategies for semantic segmentation often rely on pixel’s loss value, which tends to decrease during training. Intuitively, the pixel hardness for segmentation mainly depends on image structure and is expected to be stable. In this paper, we propose to learn pixel hardness for semantic segmentation by leveraging hardness information contained in global and historical loss values. More precisely, we add a gradient-independent branch for learning a hardness level (HL) map by maximizing hardness-weighted segmentation loss, which is minimized for the segmentation head. This encourages large hardness values in difficult areas, leading to appropriate and stable HL map. Despite its simplicity, the proposed method can be applied to most segmentation methods with no and marginal extra cost during inference and training, respectively. Without bells and whistles, the proposed method achieves consistent improvement (1.37% mIoU on average) over most popular semantic segmentation methods on the Cityscapes dataset, and demonstrates good generalization ability across domains. The source codes are available at this link.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"69 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143640777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信