Srinivasa Rao Nandam, Sara Atito, Zhenhua Feng, Josef Kittler, Muhammed Awais
{"title":"Investigating Self-Supervised Methods for Label-Efficient Learning","authors":"Srinivasa Rao Nandam, Sara Atito, Zhenhua Feng, Josef Kittler, Muhammed Awais","doi":"10.1007/s11263-025-02397-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02397-4","url":null,"abstract":"<p>Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks, including classification, segmentation, and detection. However, the potential of these models for low-shot learning across several downstream tasks remains largely under explored. In this work, we conduct a systematic examination of different self-supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling, to assess their low-shot capabilities by comparing different pretrained models. In addition, we explore the impact of various collapse avoidance techniques, such as centring, ME-MAX, and sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework that combines mask image modelling and clustering as pretext tasks. This framework demonstrates superior performance across all examined low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on large-scale datasets, we show performance gains in various tasks.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"2 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantics-Conditioned Generative Zero-Shot Learning via Feature Refinement","authors":"Shiming Chen, Ziming Hong, Xinge You, Ling Shao","doi":"10.1007/s11263-025-02394-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02394-7","url":null,"abstract":"<p>Generative zero-shot learning (ZSL) recognizes novel categories by employing a cross-modal generative model conditioned on semantic factors (such as attributes) to transfer knowledge from seen classes to unseen ones. Many existing generative ZSL methods rely solely on feature extraction models pre-trained on ImageNet, disregarding the cross-dataset bias between ImageNet and ZSL benchmarks. This bias inevitably leads to suboptimal visual features that lack semantic relevance to the predefined attributes, constraining the generator’s ability to synthesize semantically meaningful visual features for generative ZSL. In this paper, we introduce a visual feature refinement method (ViFR) to mitigate cross-dataset bias and advance generative ZSL. Given a generative ZSL model, ViFR incorporates both pre-feature refinement (Pre-FR) and post-feature refinement (Post-FR) modules to simultaneously enhance visual features. In Pre-FR, ViFR aims to learn attribute localization for discriminative visual feature representations using an attribute-guided attention mechanism optimized with attribute-based cross-entropy loss. In Post-FR, ViFR learns an effective visual<span>(rightarrow )</span>semantic mapping by integrating the semantic-conditioned generator into a unified generative model to enhance visual features. Additionally, we propose a self-adaptive margin center loss (SAMC-loss) that collaborates with semantic cycle-consistency loss to guide Post-FR in learning class- and semantically-relevant representations. The features in Post-FR are concatenated to form fully refined visual features for ZSL classification. Extensive experiments on benchmark datasets (i.e., CUB, SUN, and AWA2) demonstrate that ViFR outperforms state-of-the-art ZSL approaches. Our implementation is publicly available at https://github.com/shiming-chen/ViFR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"13 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143583010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Hierarchical Learning for 3D Semantic Segmentation","authors":"Chongshou Li, Yuheng Liu, Xinke Li, Yuning Zhang, Tianrui Li, Junsong Yuan","doi":"10.1007/s11263-025-02387-6","DOIUrl":"https://doi.org/10.1007/s11263-025-02387-6","url":null,"abstract":"<p>The inherent structure of human cognition facilitates the hierarchical organization of semantic categories for three-dimensional objects, simplifying the visual world into distinct and manageable layers. A vivid example is observed in the animal-taxonomy domain, where distinctions are not only made between broader categories like birds and mammals but also within subcategories such as different bird species, illustrating the depth of human hierarchical processing. This observation bridges to the computational realm as this paper presents deep hierarchical learning (DHL) on 3D data. By formulating a probabilistic representation, our proposed DHL lays a pioneering theoretical foundation for hierarchical learning (HL) in visual tasks. Addressing the primary challenges in effectiveness and generality of DHL for 3D data, we 1) introduce a hierarchical regularization term to connect hierarchical coherence across the predictions with the classification loss; 2) develop a general deep learning framework with a hierarchical embedding fusion module for enhanced hierarchical embedding learning; and 3) devise a novel method for constructing class hierarchies in datasets with non-hierarchical labels, leveraging recent vision language models. A novel hierarchy quality indicator, CH-MOS, supported by questionnaire-based surveys, is developed to evaluate the semantic explainability of the generated class hierarchy for human understanding. Our methodology’s validity is confirmed through extensive experiments on multiple datasets for 3D object and scene point cloud semantic segmentation tasks, demonstrating DHL’s capability in parsing 3D data across various hierarchical levels. This evidence suggests DHL’s potential for broader applicability to a wide range of tasks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"56 81 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143561192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal Transductive Inference for Few-Shot Video Object Segmentation","authors":"Mennatullah Siam","doi":"10.1007/s11263-025-02390-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02390-x","url":null,"abstract":"<p>Few-shot video object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training. In this paper, we present a simple but effective temporal transductive inference (TTI) approach that leverages temporal consistency in the unlabelled video frames during few-shot inference without episodic training. Key to our approach is the use of a video-level temporal constraint that augments frame-level constraints. The objective of the video-level constraint is to learn consistent linear classifiers for novel classes across the image sequence. It acts as a spatiotemporal regularizer during the transductive inference to increase temporal coherence and reduce overfitting on the few-shot support set. Empirically, our approach outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.5%. In addition, we introduce an improved benchmark dataset that is exhaustively labelled (i.e., all object occurrences are labelled, unlike the currently available). Our empirical results and temporal consistency analysis confirm the added benefits of the proposed spatiotemporal regularizer to improve temporal coherence. Our code and benchmark dataset is publicly available at, https://github.com/MSiam/tti_fsvos/.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding","authors":"Yi Liu, Chengxin Li, Shoukun Xu, Jungong Han","doi":"10.1007/s11263-025-02393-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02393-8","url":null,"abstract":"<p>Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion, which is essential for real-world applications like autonomous driving, where visible, depth, event, LiDAR, etc., are used. Besides, few attempts for multi-modal fusion, e.g., simple concatenation, cross-modal attention, and token selection, cannot well dig into the intrinsic shared and specific details of multiple modalities. To tackle the challenge, in this paper, we propose a Part-Whole Relational Fusion (PWRF) framework. For the first time, this framework treats multi-modal fusion as part-whole relational fusion. It routes multiple individual part-level modalities to a fused whole-level modality using the part-whole relational routing ability of Capsule Networks (CapsNets). Through this part-whole routing, our PWRF generates modal-shared and modal-specific semantics from the whole-level modal capsules and the routing coefficients, respectively. On top of that, modal-shared and modal-specific details can be employed to solve the issue of multi-modal scene understanding, including synthetic multi-modal segmentation and visible-depth-thermal salient object detection in this paper. Experiments on several datasets demonstrate the superiority of the proposed PWRF framework for multi-modal scene understanding. The source code has been released on https://github.com/liuyi1989/PWRF.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143561193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feiyang Yang, Xiongfei Li, Bo Wang, Peihong Teng, Guifeng Liu
{"title":"UMSCS: A Novel Unpaired Multimodal Image Segmentation Method Via Cross-Modality Generative and Semi-supervised Learning","authors":"Feiyang Yang, Xiongfei Li, Bo Wang, Peihong Teng, Guifeng Liu","doi":"10.1007/s11263-025-02389-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02389-4","url":null,"abstract":"<p>Multimodal medical image segmentation is crucial for enhancing diagnostic accuracy in various clinical settings. However, due to the difficulty of obtaining complete data in real clinical settings, the use of unpaired and unlabeled multimodal data is severely limited. This results in unpaired data being unusable as simultaneous input for models due to spatial misalignments and morphological differences, and unlabeled data failing to provide effective supervisory signals for models. To alleviate these issues, we propose a semi-supervised multimodal segmentation method based on cross-modal generative that seamlessly integrates image translation and segmentation stages. In the cross-modalities generative stage, we employ adversarial learning to discern the latent anatomical correlations across various modalities, followed by maintaining a balance between semantic consistency and structural consistency in image translation through region-aware constraints and cross-modal structural information contrastive learning with dynamic weight adjustment. In the segmentation stage, we employ a teacher-student semi-supervised learning (SSL) framework where the student network distills multimodal knowledge from the teacher network and utilizes unlabeled source data to enhance the supervisory signal. Experimental results demonstrate that our proposed method achieves state-of-the-art performance in extensive experiments on the segmentation tasks of cardiac substructures and multi-organs abdominal, outperforming other competitive methods.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"87 1 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"METS: Motion-Encoded Time-Surface for Event-Based High-Speed Pose Tracking","authors":"Ninghui Xu, Lihui Wang, Zhiting Yao, Takayuki Okatani","doi":"10.1007/s11263-025-02379-6","DOIUrl":"https://doi.org/10.1007/s11263-025-02379-6","url":null,"abstract":"<p>We present a novel event-based representation, named Motion-Encoded Time-Surface (METS), and how it can be used to address the challenge of pose tracking under high-speed scenarios with an event camera. The core concept is dynamically encoding the pixel-wise decay rate of the Time-Surface to account for localized spatio-temporal scene dynamics captured by events, rendering remarkable adaptability with respect to motion dynamics. The consistency between METS and the scene in highly dynamic conditions establishes a reliable foundation for robust pose estimation. Building upon this, we employ a semi-dense 3D-2D alignment pipeline to fully unlock the potential of the event camera for high-speed tracking applications. Given the intrinsic characteristics of METS, we further develop specialized lightweight operations aimed at minimizing the per-event computational cost. The proposed algorithm is successfully evaluated on public datasets and our high-speed motion datasets covering various scenes and motion complexities. It shows that our approach outperforms state-of-the-art pose tracking methods, especially in highly dynamic scenarios, and is capable of tracking accurately under incredibly fast motions that are inaccessible for other event- or frame-based counterparts. Due to its simplicity, our algorithm exhibits outstanding practicality, running at over 70 Hz on a standard CPU.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"44 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guosong Jiang, Pengfei Zhu, Bing Cao, Dongyue Chen, Qinghua Hu
{"title":"Unknown Support Prototype Set for Open Set Recognition","authors":"Guosong Jiang, Pengfei Zhu, Bing Cao, Dongyue Chen, Qinghua Hu","doi":"10.1007/s11263-025-02384-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02384-9","url":null,"abstract":"<p>In real-world applications, visual recognition systems inevitably encounter unknown classes which are not present in the training set. Open set recognition aims to classify samples from known classes and detect unknowns, simultaneously. One promising solution is to inject unknowns into training sets, and significant progress has been made on how to build an unknowns generator. However, what unknowns exhibit strong generalization is rarely explored. This work presents a new concept called <i>Unknown Support Prototypes</i>, which serve as good representatives for potential unknown classes. Two novel metrics coined <i>Support</i> and <i>Diversity</i> are introduced to construct <i>Unknown Support Prototype Set</i>. In the algorithm, we further propose to construct <i>Unknown Support Prototypes</i> in the semantic subspace of the feature space, which can largely reduce the cardinality of <i>Unknown Support Prototype Set</i> and enhance the reliability of unknowns generation. Extensive experiments on several benchmark datasets demonstrate the proposed algorithm offers effective generalization for unknowns.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143532573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LaMD: Latent Motion Diffusion for Image-Conditional Video Generation","authors":"Yaosi Hu, Zhenzhong Chen, Chong Luo","doi":"10.1007/s11263-025-02386-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02386-7","url":null,"abstract":"<p>The video generation field has witnessed rapid improvements with the introduction of recent diffusion models. While these models have successfully enhanced appearance quality, they still face challenges in generating coherent and natural movements while efficiently sampling videos. In this paper, we propose to condense video generation into a problem of motion generation, to improve the expressiveness of motion and make video generation more manageable. This can be achieved by breaking down the video generation process into latent motion generation and video reconstruction. Specifically, we present a latent motion diffusion (LaMD) framework, which consists of a motion-decomposed video autoencoder and a diffusion-based motion generator, to implement this idea. Through careful design, the motion-decomposed video autoencoder can compress patterns in movement into a concise latent motion representation. Consequently, the diffusion-based motion generator is able to efficiently generate realistic motion on a continuous latent space under multi-modal conditions, at a cost that is similar to that of image diffusion models. Results show that LaMD generates high-quality videos on various benchmark datasets, including BAIR, Landscape, NATOPS, MUG and CATER-GEN, that encompass a variety of stochastic dynamics and highly controllable movements on multiple image-conditional video generation tasks, while significantly decreases sampling time.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143532572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tobias Riedlinger, Marius Schubert, Sarina Penquitt, Jan-Marcel Kezmann, Pascal Colling, Karsten Kahl, Lutz Roese-Koerner, Michael Arnold, Urs Zimmermann, Matthias Rottmann
{"title":"LMD: Light-Weight Prediction Quality Estimation for Object Detection in Lidar Point Clouds","authors":"Tobias Riedlinger, Marius Schubert, Sarina Penquitt, Jan-Marcel Kezmann, Pascal Colling, Karsten Kahl, Lutz Roese-Koerner, Michael Arnold, Urs Zimmermann, Matthias Rottmann","doi":"10.1007/s11263-025-02377-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02377-8","url":null,"abstract":"<p>Object detection on Lidar point cloud data is a promising technology for autonomous driving and robotics which has seen a significant rise in performance and accuracy during recent years. Particularly uncertainty estimation is a crucial component for down-stream tasks and deep neural networks remain error-prone even for predictions with high confidence. Previously proposed methods for quantifying prediction uncertainty tend to alter the training scheme of the detector or rely on prediction sampling which results in vastly increased inference time. In order to address these two issues, we propose LidarMetaDetect (LMD), a light-weight post-processing scheme for prediction quality estimation. Our method can easily be added to any pre-trained Lidar object detector without altering anything about the base model and is purely based on post-processing, therefore, only leading to a negligible computational overhead. Our experiments show a significant increase of statistical reliability in separating true from false predictions. We show that this improvement carries over to object detection performance when replacing the objectness score native to the object detector. We propose and evaluate an additional application of our method leading to the detection of annotation errors. Explicit samples and a conservative count of annotation error proposals indicates the viability of our method for large-scale datasets like KITTI and nuScenes. On the widely-used nuScenes test dataset, 43 out of the top 100 proposals of our method indicate, in fact, erroneous annotations.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"32 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143518814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}