Jiahao Nie, Fei Xie, Sifan Zhou, Xueyi Zhou, Dong-Kyu Chae, Zhiwei He
{"title":"P2P: Part-to-Part Motion Cues Guide a Strong Tracking Framework for LiDAR Point Clouds","authors":"Jiahao Nie, Fei Xie, Sifan Zhou, Xueyi Zhou, Dong-Kyu Chae, Zhiwei He","doi":"10.1007/s11263-025-02430-6","DOIUrl":"https://doi.org/10.1007/s11263-025-02430-6","url":null,"abstract":"<p>3D single object tracking (SOT) methods based on appearance matching has long suffered from insufficient appearance information incurred by incomplete, textureless and semantically deficient LiDAR point clouds. While motion paradigm exploits motion cues instead of appearance matching for tracking, it incurs complex multi-stage processing and segmentation module. In this paper, we first provide in-depth explorations on motion paradigm, which proves that (<b>i</b>) it is feasible to directly infer target relative motion from point clouds across consecutive frames; (<b>ii</b>) fine-grained information comparison between consecutive point clouds facilitates target motion modeling. We thereby propose to perform part-to-part motion modeling for consecutive point clouds and introduce a novel tracking framework, termed <b>P2P</b>. The novel framework fuses each corresponding part information between consecutive point clouds, effectively exploring detailed information changes and thus modeling accurate target-related motion cues. Following this framework, we present P2P-point and P2P-voxel models, incorporating implicit and explicit part-to-part motion modeling by point- and voxel-based representation, respectively. Without bells and whistles, P2P-voxel sets a new state-of-the-art performance (<span>(sim )</span><b>89%</b>, <b>72%</b> and <b>63%</b> precision on KITTI, NuScenes and Waymo Open Dataset, respectively). Moreover, under the same point-based representation, P2P-point outperforms the previous motion tracker M<span>(^2)</span>Track by <b>3.3%</b> and <b>6.7%</b> on the KITTI and NuScenes, while running at a considerably high speed of <b>107 Fps</b> on a single RTX3090 GPU. The source code and pre-trained models are available at https://github.com/haooozi/P2P.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"28 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143853420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuhui Liu, Hong Li, Zhi Qiao, Yawen Huang, Xi Liu, Juan Zhang, Zhen Qian, Xiantong Zhen, Baochang Zhang
{"title":"D3T: Dual-Domain Diffusion Transformer in Triplanar Latent Space for 3D Incomplete-View CT Reconstruction","authors":"Xuhui Liu, Hong Li, Zhi Qiao, Yawen Huang, Xi Liu, Juan Zhang, Zhen Qian, Xiantong Zhen, Baochang Zhang","doi":"10.1007/s11263-025-02426-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02426-2","url":null,"abstract":"<p>Computed tomography (CT) is a cornerstone of clinical imaging, yet its accessibility in certain scenarios is constrained by radiation exposure concerns and operational limitations within surgical environments. CT reconstruction from incomplete views has attracted increasing research attention due to its great potential in medical applications. However, it is inherently an ill-posed problem, which, coupled with the complex, high-dimensional characteristics of 3D medical data, poses great challenges such as artifact mitigation, global incoherence, and high computational costs. To tackle those challenges, this paper introduces D3T, a new 3D conditional diffusion transformer that models 3D CT distributions in the low-dimensional 2D latent space for incomplete-view CT reconstruction. Our approach comprises two primary components: a triplanar vector quantized auto-encoder (TriVQAE) and a latent dual-domain diffusion transformer (LD3T). TriVQAE encodes high-resolution 3D CT images into compact 2D latent triplane codes which effectively factorize the intricate CT structures, further enabling compute-friendly diffusion model architecture design. Operating in the latent triplane space, LD3T significantly reduces the complexity of capturing the intricate structures in CT images. Its improved diffusion transformer architecture efficiently understands the global correlations across the three planes, ensuring high-fidelity 3D reconstructions. LD3T presents a new dual-domain conditional generation pipeline that incorporates both image and projection conditions, facilitating controllable reconstruction to produce 3D structures consistent with the given conditions. Moreover, LD3T introduces a new Dual-Space Consistency Loss that integrates image-level supervision beyond standard supervision in the latent space to enhance consistency in the 3D image space. Extensive experiments on four datasets with three inverse settings demonstrate the effectiveness of our proposal.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"74 4 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143836981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linfeng Tang, Qinglong Yan, Xinyu Xiang, Leyuan Fang, Jiayi Ma
{"title":"C2RF: Bridging Multi-modal Image Registration and Fusion via Commonality Mining and Contrastive Learning","authors":"Linfeng Tang, Qinglong Yan, Xinyu Xiang, Leyuan Fang, Jiayi Ma","doi":"10.1007/s11263-025-02427-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02427-1","url":null,"abstract":"<p>Existing image fusion methods are typically only applicable to strictly aligned source images, and they introduce undesirable artifacts when source images are misaligned, compromising visual perception and downstream applications. In this work, we propose a mutually promoting multi-modal image registration and fusion framework based on commonality mining and contrastive learning, named C2RF. We adaptively decompose multi-modal images into modality-invariant common features and modality-specific unique features. Effective disentanglement not only reduces the difficulty of cross-modal registration but also facilitates purposeful information aggregation. Moreover, C2RF incorporates fusion-based contrastive learning to explicitly model the requirements of fusion on registration, which breaks the dilemma that registration and fusion are independent of each other. The aligned and misaligned fusion results act as positive and negative samples to guide registration optimization. Particularly, negative samples generated with hard negative sample mining enable our fusion results away from artifacts. Extensive experiments demonstrate that C2RF outperforms other competitors in both multi-modal image registration and fusion, notably in bolstering the robustness of image fusion to misalignment. The source code has been released at https://github.com/QinglongYan-hub/C2RF.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"218 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143832514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingxin Huang, Dezhi Peng, Hongliang Li, Zhenghao Peng, Chongyu Liu, Dahua Lin, Yuliang Liu, Xiang Bai, Lianwen Jin
{"title":"SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting","authors":"Mingxin Huang, Dezhi Peng, Hongliang Li, Zhenghao Peng, Chongyu Liu, Dahua Lin, Yuliang Liu, Xiang Bai, Lianwen Jin","doi":"10.1007/s11263-025-02428-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02428-0","url":null,"abstract":"<p>End-to-end scene text spotting, which aims to read the text in natural images, has garnered significant attention in recent years. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2, which seeks to find a better synergy between text detection and recognition. Specifically, we enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules. Recognition Conversion explicitly guides text localization through recognition loss, while Recognition Alignment dynamically extracts text features for recognition through the detection predictions. This simple yet effective design results in a concise framework that requires neither an additional rectification module nor character-level annotations for the arbitrarily-shaped text. Furthermore, the parameters of the detector are greatly reduced without performance degradation by introducing a Box Selection Schedule. Qualitative and quantitative experiments demonstrate that SwinTextSpotter v2 achieves state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks. The code will be available at https://github.com/mxin262/SwinTextSpotterv2.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"90 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143836980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey of Representation Learning, Optimization Strategies, and Applications for Omnidirectional Vision","authors":"Hao Ai, Zidong Cao, Lin Wang","doi":"10.1007/s11263-025-02391-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02391-w","url":null,"abstract":"<p>Omnidirectional image (ODI) data is captured with a field-of-view of <span>(360^circ times 180^circ )</span>, which is much wider than the pinhole cameras and captures richer surrounding environment details than the conventional perspective images. In recent years, the availability of customer-level <span>(360^circ )</span> cameras has made omnidirectional vision more popular, and the advance of deep learning (DL) has significantly sparked its research and applications. This paper presents a systematic and comprehensive review and analysis of the recent progress of DL for omnidirectional vision. It delineates the distinct challenges and complexities encountered in applying DL to omnidirectional images as opposed to traditional perspective imagery. Our work covers four main contents: (i) A thorough introduction to the principles of omnidirectional imaging and commonly explored projections of ODI; (ii) A methodical review of varied representation learning approaches tailored for ODI; (iii) An in-depth investigation of optimization strategies specific to omnidirectional vision; (iv) A structural and hierarchical taxonomy of the DL methods for the representative omnidirectional vision tasks, from visual enhancement (<i>e</i>.<i>g</i>., image generation and super-resolution) to 3D geometry and motion estimation (<i>e</i>.<i>g</i>., depth and optical flow estimation), alongside the discussions on emergent research directions; (v) An overview of cutting-edge applications (<i>e</i>.<i>g</i>., autonomous driving and virtual reality), coupled with a critical discussion on prevailing challenges and open questions, to trigger more research in the community.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"3 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143814180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Segment Anything in 3D with Radiance Fields","authors":"Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian","doi":"10.1007/s11263-025-02421-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02421-7","url":null,"abstract":"<p>The Segment Anything Model (SAM) emerges as a powerful vision foundation model to generate high-quality 2D segmentation results. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as <b>SA3D</b>, short for Segment Anything in 3D. With SA3D, the user is only required to provide a 2D segmentation prompt (e.g., rough points) for the target object in a <b>single view</b>, which is used to generate its corresponding 2D mask with SAM. Next, SA3D alternately performs <b>mask inverse rendering</b> and <b>cross-view self-prompting</b> across various views to iteratively refine the 3D mask of the target object. For one view, mask inverse rendering projects the 2D mask obtained by SAM into the 3D space with guidance of the density distribution learned by the radiance field for 3D mask refinement. Then, cross-view self-prompting extracts reliable prompts automatically as the input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds. Our research reveals a potential methodology to lift the ability of a 2D segmentation model to 3D. Our code is available at https://github.com/Jumpat/SegmentAnythingin3D.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143814357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuanmeng Zhang, Jianfeng Zhang, Chenxu Zhang, Jun Hao Liew, Huichao Zhang, Yi Yang, Jiashi Feng
{"title":"AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text","authors":"Xuanmeng Zhang, Jianfeng Zhang, Chenxu Zhang, Jun Hao Liew, Huichao Zhang, Yi Yang, Jiashi Feng","doi":"10.1007/s11263-025-02423-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02423-5","url":null,"abstract":"<p>We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a generative model that yields explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio proposes to incorporate articulation modeling into the explicit mesh representation to support high-resolution rendering and avatar animation. To ensure view consistency and pose controllability of the resulting avatars, we introduce a simple-yet-effective 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text ready for animation. Furthermore, it is competent for many applications, <i>e.g.</i>, multimodal avatar animations and style-guided avatar creation. Please refer to our project page for more results.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143790143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu
{"title":"Diffusion-Enhanced Test-Time Adaptation with Text and Image Augmentation","authors":"Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu","doi":"10.1007/s11263-025-02412-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02412-8","url":null,"abstract":"<p>Existing test-time prompt tuning (TPT) methods focus on single-modality data, primarily enhancing images and using confidence ratings to filter out inaccurate images. However, while image generation models can produce visually diverse images, single-modality data enhancement techniques still fail to capture the comprehensive knowledge provided by different modalities. Additionally, we note that the performance of TPT-based methods drops significantly when the number of augmented images is limited, which is not unusual given the computational expense of generative augmentation. To address these issues, we introduce <span>(text {IT}^{3}text {A})</span>, a novel test-time adaptation method that utilizes a pre-trained generative model for multi-modal augmentation of each test sample from unknown new domains. By combining augmented data from pre-trained vision and language models, we enhance the ability of the model to adapt to unknown new test data. Additionally, to ensure that key semantics are accurately retained when generating various visual and text enhancements, we employ cosine similarity filtering between the logits of the enhanced images and text with the original test data. This process allows us to filter out some spurious augmentation and inadequate combinations. To leverage the diverse enhancements provided by the generation model across different modals, we have replaced prompt tuning with an adapter for greater flexibility in utilizing text templates. Our experiments on the test datasets with distribution shifts and domain gaps show that in a zero-shot setting, <span>(text {IT}^{3}text {A})</span> outperforms state-of-the-art test-time prompt tuning methods with a 5.50% increase in accuracy.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143784814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NU-AIR: A Neuromorphic Urban Aerial Dataset for Detection and Localization of Pedestrians and Vehicles","authors":"Craig Iaboni, Thomas Kelly, Pramod Abichandani","doi":"10.1007/s11263-025-02418-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02418-2","url":null,"abstract":"<p>This paper presents an open-source aerial neuromorphic dataset that captures pedestrians and vehicles moving in an urban environment. The dataset, titled NU-AIR, features over 70 min of event footage acquired with a 640 <span>(times )</span> 480 resolution neuromorphic sensor mounted on a quadrotor operating in an urban environment. Crowds of pedestrians, different types of vehicles, and street scenes featuring busy urban environments are captured at different elevations and illumination conditions. Manual bounding box annotations of vehicles and pedestrians contained in the recordings are provided at a frequency of 30 Hz, yielding more than 93,000 labels in total. A baseline evaluation for this dataset was performed using three Spiking Neural Networks (SNNs) and ten Deep Neural Networks (DNNs). All data and Python code to voxelize the data and subsequently train SNNs/DNNs has been open-sourced.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"107 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143766924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Free Lunch to Meet the Gap: Intermediate Domain Reconstruction for Cross-Domain Few-Shot Learning","authors":"Tong Zhang, Yifan Zhao, Liangyu Wang, Jia Li","doi":"10.1007/s11263-025-02419-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02419-1","url":null,"abstract":"<p>Cross-domain few-shot learning (CDFSL) endeavors to transfer generalized knowledge from the source domain to target domains using only a minimal amount of training data, which faces a triplet of learning challenges in the meantime, <i>i.e.</i>, semantic disjoint, large domain discrepancy, and data scarcity. Different from predominant CDFSL works focused on generalized representations, we make novel attempts to construct intermediate domain proxies (IDP) with source feature embeddings as the <i>codebook</i> and reconstruct the target domain feature with this learned <i>codebook</i>. We then conduct an empirical study to explore the intrinsic attributes from perspectives of <i>visual styles</i> and <i>semantic contents</i> in intermediate domain proxies. Reaping benefits from these attributes of intermediate domains, we develop a fast domain alignment method to use these proxies as learning guidance for target domain feature transformation. With the collaborative learning of intermediate domain reconstruction and target feature transformation, our proposed model is able to surpass the state-of-the-art models by a margin on 8 cross-domain few-shot learning benchmarks. Our code and models will be publicly available.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143758240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}