{"title":"Spatial-Temporal Transformer for Single RGB-D Camera Synchronous Tracking and Reconstruction of Non-rigid Dynamic Objects","authors":"Xiaofei Liu, Zhengkun Yi, Xinyu Wu, Wanfeng Shang","doi":"10.1007/s11263-025-02469-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02469-5","url":null,"abstract":"<p>We propose a simple and effective method that views the problem of single RGB-D camera synchronous tracking and reconstruction of non-rigid dynamic objects as an aligned sequential point cloud prediction problem. Our method does not require additional data transformations (truncated signed distance function or deformation graphs, etc.), alignment constraints (handcrafted features or optical flow, etc.), and prior regularities (as-rigid-as-possible or embedded deformation, etc.). We propose an end-to-end model architecture that is <b>TR</b>ansformer <b>for</b> synchronous <b>T</b>racking and <b>R</b>econstruction of non-rigid dynamic target based on RGB-D images from a monocular camera, called TR4TR. We use a spatial-temporal combined 2D image encoder that directly encodes features from RGB-D sequence images, and a 3D point decoder to generate aligned sequential point cloud containing tracking and reconstruction results. The TR4TR model outperforms the baselines on the DeepDeform non-rigid dataset, and outperforms the state-of-the-art method by 8.82% on the deformation error evaluation metric. In addition, TR4TR is more robust when the target undergoes large inter-frame deformation. The code is available at https://github.com/xfliu1998/tr4tr-main.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144104762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
De Cheng, Lingfeng He, Nannan Wang, Dingwen Zhang, Xinbo Gao
{"title":"Semantic-Aligned Learning with Collaborative Refinement for Unsupervised VI-ReID","authors":"De Cheng, Lingfeng He, Nannan Wang, Dingwen Zhang, Xinbo Gao","doi":"10.1007/s11263-025-02461-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02461-z","url":null,"abstract":"<p>Unsupervised visible-infrared person re-identification (USL-VI-ReID) seeks to match pedestrian images of the same individual across different modalities without human annotations for model learning. Previous methods unify pseudo-labels of cross-modality images through label association algorithms and then design contrastive learning framework for global feature learning. However, these methods overlook the cross-modality variations in feature representation and pseudo-label distributions brought by fine-grained patterns. This insight results in insufficient modality-shared learning when only global features are optimized. To address this issue, we propose a Semantic-Aligned Learning with Collaborative Refinement (SALCR) framework, which builds up optimization objective for specific fine-grained patterns emphasized by each modality, thereby achieving complementary alignment between the label distributions of different modalities. Specifically, we first introduce a Dual Association with Global Learning (DAGI) module to unify the pseudo-labels of cross-modality instances in a bi-directional manner. Afterward, a Fine-Grained Semantic-Aligned Learning (FGSAL) module is carried out to explore part-level semantic-aligned patterns emphasized by each modality from cross-modality instances. Optimization objective is then formulated based on the semantic-aligned features and their corresponding label space. To alleviate the side-effects arising from noisy pseudo-labels, we propose a Global-Part Collaborative Refinement (GPCR) module to mine reliable positive sample sets for the global and part features dynamically and optimize the inter-instance relationships. Extensive experiments demonstrate the effectiveness of the proposed method, which achieves superior performances to state-of-the-art methods. Our code is available at https://github.com/FranklinLingfeng/code-for-SALCR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144097302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning to Deblur Polarized Images","authors":"Chu Zhou, Minggui Teng, Xinyu Zhou, Chao Xu, Imari Sato, Boxin Shi","doi":"10.1007/s11263-025-02459-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02459-7","url":null,"abstract":"<p>A polarization camera can capture four linear polarized images with different polarizer angles in a single shot, which is useful in polarization-based vision applications since the degree of linear polarization (DoLP) and the angle of linear polarization (AoLP) can be directly computed from the captured polarized images. However, since the on-chip micro-polarizers block part of the light so that the sensor often requires a longer exposure time, the captured polarized images are prone to motion blur caused by camera shakes, leading to noticeable degradation in the computed DoLP and AoLP. Deblurring methods for conventional images often show degraded performance when handling the polarized images since they only focus on deblurring without considering the polarization constraints. In this paper, we propose a polarized image deblurring pipeline to solve the problem in a polarization-aware manner by adopting a divide-and-conquer strategy to explicitly decompose the problem into two less ill-posed sub-problems, and design a two-stage neural network to handle the two sub-problems respectively. Experimental results show that our method achieves state-of-the-art performance on both synthetic and real-world images, and can improve the performance of polarization-based vision applications such as image dehazing and reflection removal.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"76 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144088326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laurent Valentin Jospin, Hamid Laga, Farid Boussaid, Mohammed Bennamoun
{"title":"Generalized Closed-Form Formulae for Feature-Based Subpixel Alignment in Patch-Based Matching","authors":"Laurent Valentin Jospin, Hamid Laga, Farid Boussaid, Mohammed Bennamoun","doi":"10.1007/s11263-025-02457-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02457-9","url":null,"abstract":"<p>Patch-based matching is a technique meant to measure the disparity between pixels in a source and target image and is at the core of various methods in computer vision. When the subpixel disparity between the source and target images is required, the cost function or the target image has to be interpolated. While cost-based interpolation is easier to implement, multiple works have shown that image-based interpolation can increase the accuracy of the disparity estimate. In this paper we review closed-form formulae for subpixel disparity computation for one dimensional matching, e.g., rectified stereo matching, for the standard cost functions used in patch-based matching. We then propose new formulae to generalize to high-dimensional search spaces, which is necessary for unrectified stereo matching and optical flow. We also compare the image-based interpolation formulae with traditional cost-based formulae, and show that image-based interpolation brings a significant improvement over the cost-based interpolation methods for two dimensional search spaces, and small improvement in the case of one dimensional search spaces. The zero-mean normalized cross correlation cost function is found to be preferable for subpixel alignment. A new error model, based on very broad assumptions is outlined in the Supplementary Material to demonstrate why these image-based interpolation formulae outperform their cost-based counterparts and why the zero-mean normalized cross correlation function is preferable for subpixel alignement.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"121 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144088324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mina Ghadimi Atigh, Stephanie Nargang, Martin Keller-Ressel, Pascal Mettes
{"title":"SimZSL: Zero-Shot Learning Beyond a Pre-defined Semantic Embedding Space","authors":"Mina Ghadimi Atigh, Stephanie Nargang, Martin Keller-Ressel, Pascal Mettes","doi":"10.1007/s11263-025-02422-6","DOIUrl":"https://doi.org/10.1007/s11263-025-02422-6","url":null,"abstract":"<p>Zero-shot recognition is centered around learning representations to transfer knowledge from seen to unseen classes. Where foundational approaches perform the transfer with semantic embedding spaces, <i>e.g.,</i> from attributes or word vectors, the current state-of-the-art relies on prompting pre-trained vision-language models to obtain class embeddings. Whether zero-shot learning is performed with attributes, CLIP, or something else, current approaches <i>de facto</i> assume that there is a pre-defined embedding space in which seen and unseen classes can be positioned. Our work is concerned with real-world zero-shot settings where a pre-defined embedding space can no longer be assumed. This is natural in domains such as biology and medicine, where class names are not common English words, rendering vision-language models useless; or neuroscience, where class relations are only given with non-semantic human comparison scores. We find that there is one data structure enabling zero-shot learning in both standard and non-standard settings: a similarity matrix spanning the seen and unseen classes. We introduce four <i>similarity-based zero-shot learning</i> challenges, tackling open-ended scenarios such as learning with uncommon class names, learning from multiple partial sources, and learning with missing knowledge. As the first step for zero-shot learning beyond a pre-defined semantic embedding space, we propose <span>(kappa )</span>-MDS, a general approach that obtains a prototype for each class on any manifold from similarities alone, even when part of the similarities are missing. Our approach can be plugged into any standard, hyperspherical, or hyperbolic zero-shot learner. Experiments on existing datasets and the new benchmarks show the promise and challenges of similarity-based zero-shot learning.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"127 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144083184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shoukang Hu, Fangzhou Hong, Tao Hu, Liang Pan, Haiyi Mei, Weiye Xiao, Lei Yang, Ziwei Liu
{"title":"HumanLiff: Layer-wise 3D Human Diffusion Model","authors":"Shoukang Hu, Fangzhou Hong, Tao Hu, Liang Pan, Haiyi Mei, Weiye Xiao, Lei Yang, Ziwei Liu","doi":"10.1007/s11263-025-02477-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02477-5","url":null,"abstract":"<p>3D human generation from 2D images has achieved remarkable progress through the synergistic utilization of neural rendering and generative models. Existing 3D human generative models mainly generate a clothed 3D human as an inseparable 3D model in a single pass, while rarely considering the layer-wise nature of a clothed human body, which often consists of the human body and various clothes such as underwear, outerwear, trousers, shoes, etc. In this work, we propose <b>HumanLiff</b>, the first layer-wise 3D human generative model with a unified diffusion process. Specifically, HumanLiff firstly generates minimal-clothed humans, represented by tri-plane features, in a canonical space, and then progressively generates clothes in a layer-wise manner. In this way, the 3D human generation is thus formulated as a sequence of diffusion-based 3D conditional generation. To reconstruct more fine-grained 3D humans with tri-plane representation, we propose a tri-plane shift operation that splits each tri-plane into three sub-planes and shifts these sub-planes to enable feature grid subdivision. To further enhance the controllability of 3D generation with 3D layered conditions, HumanLiff hierarchically fuses tri-plane features and 3D layered conditions to facilitate the 3D diffusion model learning. Extensive experiments on two layer-wise 3D human datasets, SynBody (synthetic) and TightCap (real-world), validate that HumanLiff significantly outperforms state-of-the-art methods in layer-wise 3D human generation. Our code and datasets are available at https://skhu101.github.io/HumanLiff.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"15 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144066060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Fidelity Image Inpainting with Multimodal Guided GAN Inversion","authors":"Libo Zhang, Yongsheng Yu, Jiali Yao, Heng Fan","doi":"10.1007/s11263-025-02448-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02448-w","url":null,"abstract":"<p>Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion-based methods usually utilize well-trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed <i>MMInvertFill</i>, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre-modulation and a GAN generator with <span>( mathcal {F} & mathcal {W}^+)</span> latent space. Specifically, the multimodal encoder aims to enhance the multi-scale structures with additional semantic segmentation edge texture modalities through a gated mask-aware attention module. Afterwards, a pre-modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the <span>( mathcal {F} & mathcal {W}^+)</span> latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft-update Mean Latent module to capture more diversified in-domain patterns for generating high-fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, including CelebA-HQ, Places2, OST, CityScapes, MetFaces and Scenery, we show that our MMInvertFill qualitatively and quantitatively outperforms other state-of-the-arts and it supports the completion of out-of-domain images effectively. Our project webpage including code and results will be available at https://yeates.github.io/mm-invertfill.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"14 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144067126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dawei Zhou, Nannan Wang, Bo Han, Tongliang Liu, Xinbo Gao
{"title":"Defending Against Adversarial Examples Via Modeling Adversarial Noise","authors":"Dawei Zhou, Nannan Wang, Bo Han, Tongliang Liu, Xinbo Gao","doi":"10.1007/s11263-025-02467-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02467-7","url":null,"abstract":"<p>Adversarial examples have become a major threat to the reliable application of deep learning models. Meanwhile, this issue promotes the development of adversarial defenses. Adversarial noise contains well-generalizing and misleading features, which can manipulate predicted labels to be flipped maliciously. Motivated by this, we study <i>modeling adversarial noise</i> for defending against adversarial examples by learning the transition relationship between adversarial labels (<i>i.e.</i>, flipped labels caused by adversarial noise) and natural labels (<i>i.e.</i>, real labels of natural samples). In this work, we propose an adversarial defense method from the perspective of modeling adversarial noise. Specifically, we construct an instance-dependent label transition matrix to represent the label transition relationship for explicitly modeling adversarial noise. The label transition matrix is obtained from the input sample by leveraging a label transition network. By exploiting the label transition matrix, we can infer the natural label from the adversarial label and thus correct wrong predictions misled by adversarial noise. Additionally, to enhance the robustness of the label transition network, we design an adversarial robustness constraint at the transition matrix level. Experimental results demonstrate that our method effectively improves the robust accuracy against multiple attacks and exhibits great performance in detecting adversarial input samples.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"28 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143979610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IPAD: Iterative, Parallel, and Diffusion-Based Network for Scene Text Recognition","authors":"Xiaomeng Yang, Zhi Qiao, Yu Zhou","doi":"10.1007/s11263-025-02443-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02443-1","url":null,"abstract":"<p>Nowadays, scene text recognition has attracted more and more attention due to its diverse applications. Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right. Despite the convincing performance, this sequential decoding strategy constrains the inference speed. Conversely, non-autoregressive models provide faster, simultaneous predictions but often sacrifice accuracy. Although utilizing an explicit language model can improve performance, it burdens the computational load. Besides, separating linguistic knowledge from vision information may harm the final prediction. In this paper, we propose an alternative solution that uses a parallel and iterative decoder that adopts an easy-first decoding strategy. Furthermore, we regard text recognition as an image-based conditional text generation task and utilize the discrete diffusion strategy, ensuring exhaustive exploration of bidirectional contextual information. Extensive experiments demonstrate that the proposed approach achieves superior results on the benchmark datasets, including both Chinese and English text images.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"17 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143979615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanhan Zhang, Qinghong Sun, Yichun Zhou, Zexin He, Zhenfei Yin, Kun Wang, Lu Sheng, Yu Qiao, Jing Shao, Ziwei Liu
{"title":"Bamboo: Building Mega-Scale Vision Dataset Continually with Human–Machine Synergy","authors":"Yuanhan Zhang, Qinghong Sun, Yichun Zhou, Zexin He, Zhenfei Yin, Kun Wang, Lu Sheng, Yu Qiao, Jing Shao, Ziwei Liu","doi":"10.1007/s11263-025-02450-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02450-2","url":null,"abstract":"<p>Large-scale datasets play a vital role in computer vision. But current datasets are annotated blindly without differentiation to samples, making the data collection inefficient and unscalable. The open question is how to build a mega-scale dataset actively. Although advanced active learning algorithms might be the answer, we experimentally found that they are lame in the realistic annotation scenario where out-of-distribution data is extensive. This work thus proposes a novel active learning framework for realistic dataset annotation. Equipped with this framework, we build a high-quality vision dataset—<b>Bamboo</b>, which consists of 69M image classification annotations with 119K categories and 28M object bounding box annotations with 809 categories. We organize these categories by a hierarchical taxonomy integrated from several knowledge bases. The classification annotations are four times larger than ImageNet22K, and that of detection is three times larger than Object365. Compared to ImageNet22K and Objects365, models pre-trained on Bamboo achieve superior performance among various downstream tasks (6.2% gains on classification and 2.1% gains on detection). We believe our active learning framework and Bamboo are essential for future work. Code and dataset are available at https://github.com/ZhangYuanhan-AI/Bamboo.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"123 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143940383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}