International Journal of Computer Vision最新文献

筛选
英文 中文
Segment Anything in 3D with Radiance Fields 分割任何在3D与辐射场
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-04-09 DOI: 10.1007/s11263-025-02421-7
Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian
{"title":"Segment Anything in 3D with Radiance Fields","authors":"Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian","doi":"10.1007/s11263-025-02421-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02421-7","url":null,"abstract":"<p>The Segment Anything Model (SAM) emerges as a powerful vision foundation model to generate high-quality 2D segmentation results. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as <b>SA3D</b>, short for Segment Anything in 3D. With SA3D, the user is only required to provide a 2D segmentation prompt (e.g., rough points) for the target object in a <b>single view</b>, which is used to generate its corresponding 2D mask with SAM. Next, SA3D alternately performs <b>mask inverse rendering</b> and <b>cross-view self-prompting</b> across various views to iteratively refine the 3D mask of the target object. For one view, mask inverse rendering projects the 2D mask obtained by SAM into the 3D space with guidance of the density distribution learned by the radiance field for 3D mask refinement. Then, cross-view self-prompting extracts reliable prompts automatically as the input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds. Our research reveals a potential methodology to lift the ability of a 2D segmentation model to 3D. Our code is available at https://github.com/Jumpat/SegmentAnythingin3D.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143814357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text AvatarStudio:从文本创建高保真和可动画的3D头像
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-04-07 DOI: 10.1007/s11263-025-02423-5
Xuanmeng Zhang, Jianfeng Zhang, Chenxu Zhang, Jun Hao Liew, Huichao Zhang, Yi Yang, Jiashi Feng
{"title":"AvatarStudio: High-Fidelity and Animatable 3D Avatar Creation from Text","authors":"Xuanmeng Zhang, Jianfeng Zhang, Chenxu Zhang, Jun Hao Liew, Huichao Zhang, Yi Yang, Jiashi Feng","doi":"10.1007/s11263-025-02423-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02423-5","url":null,"abstract":"<p>We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a generative model that yields explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio proposes to incorporate articulation modeling into the explicit mesh representation to support high-resolution rendering and avatar animation. To ensure view consistency and pose controllability of the resulting avatars, we introduce a simple-yet-effective 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text ready for animation. Furthermore, it is competent for many applications, <i>e.g.</i>, multimodal avatar animations and style-guided avatar creation. Please refer to our project page for more results.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143790143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diffusion-Enhanced Test-Time Adaptation with Text and Image Augmentation 具有文本和图像增强的扩散增强测试时间自适应
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-04-05 DOI: 10.1007/s11263-025-02412-8
Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu
{"title":"Diffusion-Enhanced Test-Time Adaptation with Text and Image Augmentation","authors":"Chun-Mei Feng, Yuanyang He, Jian Zou, Salman Khan, Huan Xiong, Zhen Li, Wangmeng Zuo, Rick Siow Mong Goh, Yong Liu","doi":"10.1007/s11263-025-02412-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02412-8","url":null,"abstract":"<p>Existing test-time prompt tuning (TPT) methods focus on single-modality data, primarily enhancing images and using confidence ratings to filter out inaccurate images. However, while image generation models can produce visually diverse images, single-modality data enhancement techniques still fail to capture the comprehensive knowledge provided by different modalities. Additionally, we note that the performance of TPT-based methods drops significantly when the number of augmented images is limited, which is not unusual given the computational expense of generative augmentation. To address these issues, we introduce <span>(text {IT}^{3}text {A})</span>, a novel test-time adaptation method that utilizes a pre-trained generative model for multi-modal augmentation of each test sample from unknown new domains. By combining augmented data from pre-trained vision and language models, we enhance the ability of the model to adapt to unknown new test data. Additionally, to ensure that key semantics are accurately retained when generating various visual and text enhancements, we employ cosine similarity filtering between the logits of the enhanced images and text with the original test data. This process allows us to filter out some spurious augmentation and inadequate combinations. To leverage the diverse enhancements provided by the generation model across different modals, we have replaced prompt tuning with an adapter for greater flexibility in utilizing text templates. Our experiments on the test datasets with distribution shifts and domain gaps show that in a zero-shot setting, <span>(text {IT}^{3}text {A})</span> outperforms state-of-the-art test-time prompt tuning methods with a 5.50% increase in accuracy.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143784814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NU-AIR: A Neuromorphic Urban Aerial Dataset for Detection and Localization of Pedestrians and Vehicles NU-AIR:用于行人和车辆检测和定位的神经形态城市航空数据集
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-04-03 DOI: 10.1007/s11263-025-02418-2
Craig Iaboni, Thomas Kelly, Pramod Abichandani
{"title":"NU-AIR: A Neuromorphic Urban Aerial Dataset for Detection and Localization of Pedestrians and Vehicles","authors":"Craig Iaboni, Thomas Kelly, Pramod Abichandani","doi":"10.1007/s11263-025-02418-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02418-2","url":null,"abstract":"<p>This paper presents an open-source aerial neuromorphic dataset that captures pedestrians and vehicles moving in an urban environment. The dataset, titled NU-AIR, features over 70 min of event footage acquired with a 640 <span>(times )</span> 480 resolution neuromorphic sensor mounted on a quadrotor operating in an urban environment. Crowds of pedestrians, different types of vehicles, and street scenes featuring busy urban environments are captured at different elevations and illumination conditions. Manual bounding box annotations of vehicles and pedestrians contained in the recordings are provided at a frequency of 30 Hz, yielding more than 93,000 labels in total. A baseline evaluation for this dataset was performed using three Spiking Neural Networks (SNNs) and ten Deep Neural Networks (DNNs). All data and Python code to voxelize the data and subsequently train SNNs/DNNs has been open-sourced.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"107 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143766924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Free Lunch to Meet the Gap: Intermediate Domain Reconstruction for Cross-Domain Few-Shot Learning 弥补差距的免费午餐:跨领域少镜头学习的中间领域重建
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-04-01 DOI: 10.1007/s11263-025-02419-1
Tong Zhang, Yifan Zhao, Liangyu Wang, Jia Li
{"title":"Free Lunch to Meet the Gap: Intermediate Domain Reconstruction for Cross-Domain Few-Shot Learning","authors":"Tong Zhang, Yifan Zhao, Liangyu Wang, Jia Li","doi":"10.1007/s11263-025-02419-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02419-1","url":null,"abstract":"<p>Cross-domain few-shot learning (CDFSL) endeavors to transfer generalized knowledge from the source domain to target domains using only a minimal amount of training data, which faces a triplet of learning challenges in the meantime, <i>i.e.</i>, semantic disjoint, large domain discrepancy, and data scarcity. Different from predominant CDFSL works focused on generalized representations, we make novel attempts to construct intermediate domain proxies (IDP) with source feature embeddings as the <i>codebook</i> and reconstruct the target domain feature with this learned <i>codebook</i>. We then conduct an empirical study to explore the intrinsic attributes from perspectives of <i>visual styles</i> and <i>semantic contents</i> in intermediate domain proxies. Reaping benefits from these attributes of intermediate domains, we develop a fast domain alignment method to use these proxies as learning guidance for target domain feature transformation. With the collaborative learning of intermediate domain reconstruction and target feature transformation, our proposed model is able to surpass the state-of-the-art models by a margin on 8 cross-domain few-shot learning benchmarks. Our code and models will be publicly available.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143758240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Fast and Lightweight 3D Keypoint Detector 一个快速和轻量级的3D关键点检测器
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-04-01 DOI: 10.1007/s11263-025-02425-3
Chengzhuan Yang, Qian Yu, Hui Wei, Fei Wu, Yunliang Jiang, Zhonglong Zheng, Ming-Hsuan Yang
{"title":"A Fast and Lightweight 3D Keypoint Detector","authors":"Chengzhuan Yang, Qian Yu, Hui Wei, Fei Wu, Yunliang Jiang, Zhonglong Zheng, Ming-Hsuan Yang","doi":"10.1007/s11263-025-02425-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02425-3","url":null,"abstract":"<p>Keypoint detection is crucial in many visual tasks, such as object recognition, shape retrieval, and 3D reconstruction, as labeling point data is labor-intensive or sometimes implausible. Nevertheless, it is challenging to quickly and accurately locate keypoints unsupervised from point clouds. This work proposes a fast and lightweight 3D keypoint detector that can efficiently and accurately detect keypoints from point clouds. Our method does not require a complex model learning process and generalizes well to new scenes. Specifically, we consider detecting keypoints a saliency detection problem for a point cloud. First, we propose a simple and effective distance measure to characterize the saliency of points in a point cloud. This distance describes geometrically essential points in the point cloud. Next, we present a regional saliency based on relative centroid distance representation that can globally characterize keypoints with regional visual information. Third, we combine geometric and semantic cues to generate a saliency map of the point cloud for determining stable 3D keypoints. We evaluate our method against existing approaches on four benchmark keypoint datasets to demonstrate its state-of-the-art performance.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"225 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143745277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Creatively Upscaling Images with Global-Regional Priors 创造性地升级图像与全球-区域先验
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-31 DOI: 10.1007/s11263-025-02424-4
Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei
{"title":"Creatively Upscaling Images with Global-Regional Priors","authors":"Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei","doi":"10.1007/s11263-025-02424-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02424-4","url":null,"abstract":"<p>Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., <span>(1024times 1024)</span>). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., <span>(4096times 4096 ,{text {and}}, 8192times 8192)</span>) with higher visual fidelity and more creative regional details.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"16 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143737213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advances in 3D Neural Stylization: A Survey 三维神经程式化研究进展综述
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-28 DOI: 10.1007/s11263-025-02403-9
Yingshu Chen, Guocheng Shao, Ka Chun Shum, Binh-Son Hua, Sai-Kit Yeung
{"title":"Advances in 3D Neural Stylization: A Survey","authors":"Yingshu Chen, Guocheng Shao, Ka Chun Shum, Binh-Son Hua, Sai-Kit Yeung","doi":"10.1007/s11263-025-02403-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02403-9","url":null,"abstract":"<p>Modern artificial intelligence offers a novel and transformative approach to creating digital art across diverse styles and modalities like images, videos and 3D data, unleashing the power of creativity and revolutionizing the way that we perceive and interact with visual content. This paper reports on recent advances in stylized 3D asset creation and manipulation with the expressive power of neural networks. We establish a taxonomy for neural stylization, considering crucial design choices such as scene representation, guidance data, optimization strategies, and output styles. Building on such taxonomy, our survey first revisits the background of neural stylization on 2D images, and then presents in-depth discussions on recent neural stylization methods for 3D data, accompanied by a benchmark evaluating selected mesh and neural field stylization methods. Based on the insights gained from the survey, we highlight the practical significance, open challenges, future research, and potential impacts of neural stylization, which facilitates researchers and practitioners to navigate the rapidly evolving landscape of 3D content creation using modern artificial intelligence.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"10 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143723537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
$$hbox {I}^2$$ MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation $$hbox {I}^2$$ 基于模态间和模态内相互蒸馏的三维动作表示学习
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-27 DOI: 10.1007/s11263-025-02415-5
Yunyao Mao, Jiajun Deng, Wengang Zhou, Zhenbo Lu, Wanli Ouyang, Houqiang Li
{"title":"$$hbox {I}^2$$ MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation","authors":"Yunyao Mao, Jiajun Deng, Wengang Zhou, Zhenbo Lu, Wanli Ouyang, Houqiang Li","doi":"10.1007/s11263-025-02415-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02415-5","url":null,"abstract":"<p>Recent progresses on self-supervised 3D human action representation learning are largely attributed to contrastive learning. However, in conventional contrastive frameworks, the rich complementarity between different skeleton modalities remains under-explored. Moreover, optimized with distinguishing self-augmented samples, models struggle with numerous similar positive instances in the case of limited action categories. In this work, we tackle the aforementioned problems by introducing a general Inter- and intra-modal mutual distillation (<span>(hbox {I}^2)</span>MD) framework. In <span>(hbox {I}^2)</span>MD, we first re-formulate the cross-modal interaction as a cross-modal mutual distillation (CMD) process. Different from existing distillation solutions that transfer the knowledge of a pre-trained and fixed teacher to the student, in CMD, the knowledge is continuously updated and bidirectionally distilled between modalities during pre-training. To alleviate the interference of similar samples and exploit their underlying contexts, we further design the intra-modal mutual distillation (IMD) strategy, In IMD, the dynamic neighbors aggregation (DNA) mechanism is first introduced, where an additional cluster-level discrimination branch is instantiated in each modality. It adaptively aggregates highly-correlated neighboring features, forming local cluster-level contrasting. Mutual distillation is then performed between the two branches for cross-level knowledge exchange. Extensive experiments on three datasets show that our approach sets a series of new records.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"215 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143723540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pre-training for Action Recognition with Automatically Generated Fractal Datasets 基于自动生成分形数据集的动作识别预训练
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-03-26 DOI: 10.1007/s11263-025-02420-8
Davyd Svyezhentsev, George Retsinas, Petros Maragos
{"title":"Pre-training for Action Recognition with Automatically Generated Fractal Datasets","authors":"Davyd Svyezhentsev, George Retsinas, Petros Maragos","doi":"10.1007/s11263-025-02420-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02420-8","url":null,"abstract":"<p>In recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This approach resolves issues associated with real data such as collection and labeling costs, copyright and privacy. We extend this trend to the video domain applying it to the task of action recognition. Employing fractal geometry, we present methods to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. To narrow the domain gap, we further identify key properties of real videos and carefully emulate them during pre-training. Through thorough ablations, we determine the attributes that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated by fine-tuning pre-trained models on established action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks related to group action recognition, fine-grained action recognition and dynamic scenes. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets. Code and samples of synthetic videos are available at https://github.com/davidsvy/fractal_video.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"26 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143702913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信