{"title":"A Survey of Representation Learning, Optimization Strategies, and Applications for Omnidirectional Vision","authors":"Hao Ai, Zidong Cao, Lin Wang","doi":"10.1007/s11263-025-02391-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02391-w","url":null,"abstract":"<p>Omnidirectional image (ODI) data is captured with a field-of-view of <span>(360^circ times 180^circ )</span>, which is much wider than the pinhole cameras and captures richer surrounding environment details than the conventional perspective images. In recent years, the availability of customer-level <span>(360^circ )</span> cameras has made omnidirectional vision more popular, and the advance of deep learning (DL) has significantly sparked its research and applications. This paper presents a systematic and comprehensive review and analysis of the recent progress of DL for omnidirectional vision. It delineates the distinct challenges and complexities encountered in applying DL to omnidirectional images as opposed to traditional perspective imagery. Our work covers four main contents: (i) A thorough introduction to the principles of omnidirectional imaging and commonly explored projections of ODI; (ii) A methodical review of varied representation learning approaches tailored for ODI; (iii) An in-depth investigation of optimization strategies specific to omnidirectional vision; (iv) A structural and hierarchical taxonomy of the DL methods for the representative omnidirectional vision tasks, from visual enhancement (<i>e</i>.<i>g</i>., image generation and super-resolution) to 3D geometry and motion estimation (<i>e</i>.<i>g</i>., depth and optical flow estimation), alongside the discussions on emergent research directions; (v) An overview of cutting-edge applications (<i>e</i>.<i>g</i>., autonomous driving and virtual reality), coupled with a critical discussion on prevailing challenges and open questions, to trigger more research in the community.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"3 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143814180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Segment Anything in 3D with Radiance Fields","authors":"Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian","doi":"10.1007/s11263-025-02421-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02421-7","url":null,"abstract":"<p>The Segment Anything Model (SAM) emerges as a powerful vision foundation model to generate high-quality 2D segmentation results. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as <b>SA3D</b>, short for Segment Anything in 3D. With SA3D, the user is only required to provide a 2D segmentation prompt (e.g., rough points) for the target object in a <b>single view</b>, which is used to generate its corresponding 2D mask with SAM. Next, SA3D alternately performs <b>mask inverse rendering</b> and <b>cross-view self-prompting</b> across various views to iteratively refine the 3D mask of the target object. For one view, mask inverse rendering projects the 2D mask obtained by SAM into the 3D space with guidance of the density distribution learned by the radiance field for 3D mask refinement. Then, cross-view self-prompting extracts reliable prompts automatically as the input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds. Our research reveals a potential methodology to lift the ability of a 2D segmentation model to 3D. Our code is available at https://github.com/Jumpat/SegmentAnythingin3D.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143814357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuanmeng Zhang, Jianfeng Zhang, Chenxu Zhang, Jun Hao Liew, Huichao Zhang, Yi Yang, Jiashi Feng
{"title":"AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text","authors":"Xuanmeng Zhang, Jianfeng Zhang, Chenxu Zhang, Jun Hao Liew, Huichao Zhang, Yi Yang, Jiashi Feng","doi":"10.1007/s11263-025-02423-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02423-5","url":null,"abstract":"<p>We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a generative model that yields explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio proposes to incorporate articulation modeling into the explicit mesh representation to support high-resolution rendering and avatar animation. To ensure view consistency and pose controllability of the resulting avatars, we introduce a simple-yet-effective 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text ready for animation. Furthermore, it is competent for many applications, <i>e.g.</i>, multimodal avatar animations and style-guided avatar creation. Please refer to our project page for more results.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143790143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu
{"title":"Diffusion-Enhanced Test-Time Adaptation with Text and Image Augmentation","authors":"Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu","doi":"10.1007/s11263-025-02412-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02412-8","url":null,"abstract":"<p>Existing test-time prompt tuning (TPT) methods focus on single-modality data, primarily enhancing images and using confidence ratings to filter out inaccurate images. However, while image generation models can produce visually diverse images, single-modality data enhancement techniques still fail to capture the comprehensive knowledge provided by different modalities. Additionally, we note that the performance of TPT-based methods drops significantly when the number of augmented images is limited, which is not unusual given the computational expense of generative augmentation. To address these issues, we introduce <span>(text {IT}^{3}text {A})</span>, a novel test-time adaptation method that utilizes a pre-trained generative model for multi-modal augmentation of each test sample from unknown new domains. By combining augmented data from pre-trained vision and language models, we enhance the ability of the model to adapt to unknown new test data. Additionally, to ensure that key semantics are accurately retained when generating various visual and text enhancements, we employ cosine similarity filtering between the logits of the enhanced images and text with the original test data. This process allows us to filter out some spurious augmentation and inadequate combinations. To leverage the diverse enhancements provided by the generation model across different modals, we have replaced prompt tuning with an adapter for greater flexibility in utilizing text templates. Our experiments on the test datasets with distribution shifts and domain gaps show that in a zero-shot setting, <span>(text {IT}^{3}text {A})</span> outperforms state-of-the-art test-time prompt tuning methods with a 5.50% increase in accuracy.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143784814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NU-AIR: A Neuromorphic Urban Aerial Dataset for Detection and Localization of Pedestrians and Vehicles","authors":"Craig Iaboni, Thomas Kelly, Pramod Abichandani","doi":"10.1007/s11263-025-02418-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02418-2","url":null,"abstract":"<p>This paper presents an open-source aerial neuromorphic dataset that captures pedestrians and vehicles moving in an urban environment. The dataset, titled NU-AIR, features over 70 min of event footage acquired with a 640 <span>(times )</span> 480 resolution neuromorphic sensor mounted on a quadrotor operating in an urban environment. Crowds of pedestrians, different types of vehicles, and street scenes featuring busy urban environments are captured at different elevations and illumination conditions. Manual bounding box annotations of vehicles and pedestrians contained in the recordings are provided at a frequency of 30 Hz, yielding more than 93,000 labels in total. A baseline evaluation for this dataset was performed using three Spiking Neural Networks (SNNs) and ten Deep Neural Networks (DNNs). All data and Python code to voxelize the data and subsequently train SNNs/DNNs has been open-sourced.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"107 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143766924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Free Lunch to Meet the Gap: Intermediate Domain Reconstruction for Cross-Domain Few-Shot Learning","authors":"Tong Zhang, Yifan Zhao, Liangyu Wang, Jia Li","doi":"10.1007/s11263-025-02419-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02419-1","url":null,"abstract":"<p>Cross-domain few-shot learning (CDFSL) endeavors to transfer generalized knowledge from the source domain to target domains using only a minimal amount of training data, which faces a triplet of learning challenges in the meantime, <i>i.e.</i>, semantic disjoint, large domain discrepancy, and data scarcity. Different from predominant CDFSL works focused on generalized representations, we make novel attempts to construct intermediate domain proxies (IDP) with source feature embeddings as the <i>codebook</i> and reconstruct the target domain feature with this learned <i>codebook</i>. We then conduct an empirical study to explore the intrinsic attributes from perspectives of <i>visual styles</i> and <i>semantic contents</i> in intermediate domain proxies. Reaping benefits from these attributes of intermediate domains, we develop a fast domain alignment method to use these proxies as learning guidance for target domain feature transformation. With the collaborative learning of intermediate domain reconstruction and target feature transformation, our proposed model is able to surpass the state-of-the-art models by a margin on 8 cross-domain few-shot learning benchmarks. Our code and models will be publicly available.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143758240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Fast and Lightweight 3D Keypoint Detector","authors":"Chengzhuan Yang, Qian Yu, Hui Wei, Fei Wu, Yunliang Jiang, Zhonglong Zheng, Ming-Hsuan Yang","doi":"10.1007/s11263-025-02425-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02425-3","url":null,"abstract":"<p>Keypoint detection is crucial in many visual tasks, such as object recognition, shape retrieval, and 3D reconstruction, as labeling point data is labor-intensive or sometimes implausible. Nevertheless, it is challenging to quickly and accurately locate keypoints unsupervised from point clouds. This work proposes a fast and lightweight 3D keypoint detector that can efficiently and accurately detect keypoints from point clouds. Our method does not require a complex model learning process and generalizes well to new scenes. Specifically, we consider detecting keypoints a saliency detection problem for a point cloud. First, we propose a simple and effective distance measure to characterize the saliency of points in a point cloud. This distance describes geometrically essential points in the point cloud. Next, we present a regional saliency based on relative centroid distance representation that can globally characterize keypoints with regional visual information. Third, we combine geometric and semantic cues to generate a saliency map of the point cloud for determining stable 3D keypoints. We evaluate our method against existing approaches on four benchmark keypoint datasets to demonstrate its state-of-the-art performance.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"225 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143745277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei
{"title":"Creatively Upscaling Images with Global-Regional Priors","authors":"Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei","doi":"10.1007/s11263-025-02424-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02424-4","url":null,"abstract":"<p>Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., <span>(1024times 1024)</span>). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., <span>(4096times 4096 ,{text {and}}, 8192times 8192)</span>) with higher visual fidelity and more creative regional details.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"16 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143737213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingshu Chen, Guocheng Shao, Ka Chun Shum, Binh-Son Hua, Sai-Kit Yeung
{"title":"Advances in 3D Neural Stylization: A Survey","authors":"Yingshu Chen, Guocheng Shao, Ka Chun Shum, Binh-Son Hua, Sai-Kit Yeung","doi":"10.1007/s11263-025-02403-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02403-9","url":null,"abstract":"<p>Modern artificial intelligence offers a novel and transformative approach to creating digital art across diverse styles and modalities like images, videos and 3D data, unleashing the power of creativity and revolutionizing the way that we perceive and interact with visual content. This paper reports on recent advances in stylized 3D asset creation and manipulation with the expressive power of neural networks. We establish a taxonomy for neural stylization, considering crucial design choices such as scene representation, guidance data, optimization strategies, and output styles. Building on such taxonomy, our survey first revisits the background of neural stylization on 2D images, and then presents in-depth discussions on recent neural stylization methods for 3D data, accompanied by a benchmark evaluating selected mesh and neural field stylization methods. Based on the insights gained from the survey, we highlight the practical significance, open challenges, future research, and potential impacts of neural stylization, which facilitates researchers and practitioners to navigate the rapidly evolving landscape of 3D content creation using modern artificial intelligence.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"10 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143723537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"$$hbox {I}^2$$ MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation","authors":"Yunyao Mao, Jiajun Deng, Wengang Zhou, Zhenbo Lu, Wanli Ouyang, Houqiang Li","doi":"10.1007/s11263-025-02415-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02415-5","url":null,"abstract":"<p>Recent progresses on self-supervised 3D human action representation learning are largely attributed to contrastive learning. However, in conventional contrastive frameworks, the rich complementarity between different skeleton modalities remains under-explored. Moreover, optimized with distinguishing self-augmented samples, models struggle with numerous similar positive instances in the case of limited action categories. In this work, we tackle the aforementioned problems by introducing a general Inter- and intra-modal mutual distillation (<span>(hbox {I}^2)</span>MD) framework. In <span>(hbox {I}^2)</span>MD, we first re-formulate the cross-modal interaction as a cross-modal mutual distillation (CMD) process. Different from existing distillation solutions that transfer the knowledge of a pre-trained and fixed teacher to the student, in CMD, the knowledge is continuously updated and bidirectionally distilled between modalities during pre-training. To alleviate the interference of similar samples and exploit their underlying contexts, we further design the intra-modal mutual distillation (IMD) strategy, In IMD, the dynamic neighbors aggregation (DNA) mechanism is first introduced, where an additional cluster-level discrimination branch is instantiated in each modality. It adaptively aggregates highly-correlated neighboring features, forming local cluster-level contrasting. Mutual distillation is then performed between the two branches for cross-level knowledge exchange. Extensive experiments on three datasets show that our approach sets a series of new records.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"215 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143723540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}