International Journal of Computer Vision最新文献_第6页

$$hbox {I}^2$$ MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation $$hbox {I}^2$$ 基于模态间和模态内相互蒸馏的三维动作表示学习

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-27 DOI: 10.1007/s11263-025-02415-5

Yunyao Mao, Jiajun Deng, Wengang Zhou, Zhenbo Lu, Wanli Ouyang, Houqiang Li

{"title":"$$hbox {I}^2$$ MD: 3D Action Representation Learning with Inter- and Intra-Modal Mutual Distillation","authors":"Yunyao Mao, Jiajun Deng, Wengang Zhou, Zhenbo Lu, Wanli Ouyang, Houqiang Li","doi":"10.1007/s11263-025-02415-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02415-5","url":null,"abstract":"Recent progresses on self-supervised 3D human action representation learning are largely attributed to contrastive learning. However, in conventional contrastive frameworks, the rich complementarity between different skeleton modalities remains under-explored. Moreover, optimized with distinguishing self-augmented samples, models struggle with numerous similar positive instances in the case of limited action categories. In this work, we tackle the aforementioned problems by introducing a general Inter- and intra-modal mutual distillation ((hbox {I}^2)MD) framework. In (hbox {I}^2)MD, we first re-formulate the cross-modal interaction as a cross-modal mutual distillation (CMD) process. Different from existing distillation solutions that transfer the knowledge of a pre-trained and fixed teacher to the student, in CMD, the knowledge is continuously updated and bidirectionally distilled between modalities during pre-training. To alleviate the interference of similar samples and exploit their underlying contexts, we further design the intra-modal mutual distillation (IMD) strategy, In IMD, the dynamic neighbors aggregation (DNA) mechanism is first introduced, where an additional cluster-level discrimination branch is instantiated in each modality. It adaptively aggregates highly-correlated neighboring features, forming local cluster-level contrasting. Mutual distillation is then performed between the two branches for cross-level knowledge exchange. Extensive experiments on three datasets show that our approach sets a series of new records.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"215 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143723540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pre-training for Action Recognition with Automatically Generated Fractal Datasets 基于自动生成分形数据集的动作识别预训练

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-26 DOI: 10.1007/s11263-025-02420-8

Davyd Svyezhentsev, George Retsinas, Petros Maragos

{"title":"Pre-training for Action Recognition with Automatically Generated Fractal Datasets","authors":"Davyd Svyezhentsev, George Retsinas, Petros Maragos","doi":"10.1007/s11263-025-02420-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02420-8","url":null,"abstract":"In recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This approach resolves issues associated with real data such as collection and labeling costs, copyright and privacy. We extend this trend to the video domain applying it to the task of action recognition. Employing fractal geometry, we present methods to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. To narrow the domain gap, we further identify key properties of real videos and carefully emulate them during pre-training. Through thorough ablations, we determine the attributes that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated by fine-tuning pre-trained models on established action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks related to group action recognition, fine-grained action recognition and dynamic scenes. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets. Code and samples of synthetic videos are available at https://github.com/davidsvy/fractal_video.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"26 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143702913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions 场景diff：场景条件动态转换的文本到视频生成

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-25 DOI: 10.1007/s11263-025-02413-7

Yipeng Zhang, Xin Wang, Hong Chen, Chenyang Qin, Yibo Hao, Hong Mei, Wenwu Zhu

{"title":"ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions","authors":"Yipeng Zhang, Xin Wang, Hong Chen, Chenyang Qin, Yibo Hao, Hong Mei, Wenwu Zhu","doi":"10.1007/s11263-025-02413-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02413-7","url":null,"abstract":"With the development of diffusion models, text-to-video generation has recently received significant attention and achieved remarkable success. However, existing text-to-video approaches suffer from the following weaknesses: i) they fail to control the trajectory of the subject as well as the process of scene transformations; ii) they can only generate videos with limited frames, failing to capture the whole transformation process. To address these issues, we propose the model named ScenarioDiff, which is able to generate longer videos with scene transformations. Specifically, we employ a spatial layout fuser to control the positions of subjects and the scenes of each frame. To effectively present the process of scene transformation, we introduce mixed frequency controlnet, which utilizes several frames of the generated videos to extend them to long videos chunk by chunk in an auto-regressive manner. Additionally, to ensure consistency between different video chunks, we propose a cross-chunk scheduling mechanism during inference. Experimental results demonstrate the effectiveness of our approach in generating videos with dynamic scene transformations. Our project page is available at https://scenariodiff2024.github.io/.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"35 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143695282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LaneCorrect: Self-Supervised Lane Detection laneccorrect：自监督车道检测

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-24 DOI: 10.1007/s11263-025-02417-3

Ming Nie, Xinyue Cai, Hang Xu, Li Zhang

{"title":"LaneCorrect: Self-Supervised Lane Detection","authors":"Ming Nie, Xinyue Cai, Hang Xu, Li Zhang","doi":"10.1007/s11263-025-02417-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02417-3","url":null,"abstract":"Lane detection has evolved highly functional autonomous driving system to understand driving scenes even under complex environments. In this paper, we work towards developing a generalized computer vision system able to detect lanes without using any annotation. We make the following contributions: (i) We illustrate how to perform unsupervised 3D lane segmentation by leveraging the distinctive intensity of lanes on the LiDAR point cloud frames, and then obtain the noisy lane labels in the 2D plane by projecting the 3D points; (ii) We propose a novel self-supervised training scheme, dubbed LaneCorrect, that automatically corrects the lane label by learning geometric consistency and instance awareness from the adversarial augmentations; (iii) With the self-supervised pre-trained model, we distill to train a student network for arbitrary target lane (e.g., TuSimple) detection without any human labels; (iv) We thoroughly evaluate our self-supervised method on four major lane detection benchmarks (including TuSimple, CULane, CurveLanes and LLAMAS) and demonstrate excellent performance compared with existing supervised counterpart, whilst showing more effective results on alleviating the domain gap, i.e., training on CULane and test on TuSimple.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"183 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143677871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Camouflaged Object Detection with Adaptive Partition and Background Retrieval 利用自适应分区和背景检索进行伪装物体检测

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-22 DOI: 10.1007/s11263-025-02406-6

Bowen Yin, Xuying Zhang, Li Liu, Ming-Ming Cheng, Yongxiang Liu, Qibin Hou

引用次数: 0

FlowSDF: Flow Matching for Medical Image Segmentation Using Distance Transforms 使用距离变换进行医学图像分割的流匹配

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-22 DOI: 10.1007/s11263-025-02373-y

Lea Bogensperger, Dominik Narnhofer, Alexander Falk, Konrad Schindler, Thomas Pock

{"title":"FlowSDF: Flow Matching for Medical Image Segmentation Using Distance Transforms","authors":"Lea Bogensperger, Dominik Narnhofer, Alexander Falk, Konrad Schindler, Thomas Pock","doi":"10.1007/s11263-025-02373-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02373-y","url":null,"abstract":"Medical image segmentation plays an important role in accurately identifying and isolating regions of interest within medical images. Generative approaches are particularly effective in modeling the statistical properties of segmentation masks that are closely related to the respective structures. In this work we introduce FlowSDF, an image-guided conditional flow matching framework, designed to represent the signed distance function (SDF), and, in turn, to represent an implicit distribution of segmentation masks. The advantage of leveraging the SDF is a more natural distortion when compared to that of binary masks. Through the learning of a vector field associated with the probability path of conditional SDF distributions, our framework enables accurate sampling of segmentation masks and the computation of relevant statistical measures. This probabilistic approach also facilitates the generation of uncertainty maps represented by the variance, thereby supporting enhanced robustness in prediction and further analysis. We qualitatively and quantitatively illustrate competitive performance of the proposed method on a public nuclei and gland segmentation data set, highlighting its utility in medical image segmentation applications.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"27 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143672811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Preconditioned Score-Based Generative Models 基于预条件分数的生成模型

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-21 DOI: 10.1007/s11263-025-02410-w

Hengyuan Ma, Xiatian Zhu, Jianfeng Feng, Li Zhang

{"title":"Preconditioned Score-Based Generative Models","authors":"Hengyuan Ma, Xiatian Zhu, Jianfeng Feng, Li Zhang","doi":"10.1007/s11263-025-02410-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02410-w","url":null,"abstract":"Score-based generative models (SGMs) have recently emerged as a promising class of generative models. However, a fundamental limitation is that their sampling process is slow due to a need for many (e.g., 2000) iterations of sequential computations. An intuitive acceleration method is to reduce the sampling iterations which however causes severe performance degradation. We assault this problem to the ill-conditioned issues of the Langevin dynamics and reverse diffusion in the sampling process. Under this insight, we propose a novel preconditioned diffusion sampling (PDS) method that leverages matrix preconditioning to alleviate the aforementioned problem. PDS alters the sampling process of a vanilla SGM at marginal extra computation cost and without model retraining. Theoretically, we prove that PDS preserves the output distribution of the SGM, with no risk of inducing systematical bias to the original sampling process. We further theoretically reveal a relation between the parameter of PDS and the sampling iterations, easing the parameter estimation under varying sampling iterations. Extensive experiments on various image datasets with a variety of resolutions and diversity validate that our PDS consistently accelerates off-the-shelf SGMs whilst maintaining the synthesis quality. In particular, PDS can accelerate by up to (28times ) on more challenging high-resolution (1024(times )1024) image generation. Compared with the latest generative models (e.g., CLD-SGM, DDIM, and Analytic-DDIM), PDS can achieve the best sampling quality on CIFAR-10 at an FID score of 1.99. Our code is publicly available to foster any further research https://github.com/fudan-zvg/PDS.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"183 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143672810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CT3D++: Improving 3D Object Detection with Keypoint-Induced Channel-wise Transformer ct3d++：利用关键点感应通道变压器改进3D目标检测

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-20 DOI: 10.1007/s11263-025-02404-8

Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Qiao Liang, Min-Jian Zhao, Jieping Ye

{"title":"CT3D++: Improving 3D Object Detection with Keypoint-Induced Channel-wise Transformer","authors":"Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Qiao Liang, Min-Jian Zhao, Jieping Ye","doi":"10.1007/s11263-025-02404-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02404-8","url":null,"abstract":"The field of 3D object detection from point clouds is rapidly advancing in computer vision, aiming to accurately and efficiently detect and localize objects in three-dimensional space. Current 3D detectors commonly fall short in terms of flexibility and scalability, with ample room for advancements in performance. In this paper, our objective is to address these limitations by introducing two frameworks for 3D object detection. Firstly, we propose CT3D, which sequentially performs raw-point-based embedding, a standard Transformer encoder, and a channel-wise decoder for point features within each proposal. Secondly, we present an enhanced network called CT3D++, which incorporates geometric and semantic fusion-based embedding to extract more valuable and comprehensive proposal-aware information. Additionally, CT3D++ utilizes a point-to-key bidirectional encoder for more efficient feature encoding with reduced computational cost. By replacing the corresponding components of CT3D with these novel modules, CT3D++ achieves state-of-the-art performance on both the KITTI dataset and the large-scale Waymo Open Dataset. The source code for our frameworks will be made accessible at https://github.com/hlsheng1/CT3Dplusplus.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143666253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LR-ASD: Lightweight and Robust Network for Active Speaker Detection LR-ASD：用于有源说话者检测的轻量级鲁棒网络

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-19 DOI: 10.1007/s11263-025-02399-2

Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, Yanru Chen

{"title":"LR-ASD: Lightweight and Robust Network for Active Speaker Detection","authors":"Junhua Liao, Haihan Duan, Kanghui Feng, Wanbing Zhao, Yanbing Yang, Liangyin Chen, Yanru Chen","doi":"10.1007/s11263-025-02399-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02399-2","url":null,"abstract":"Active speaker detection is a challenging task aimed at identifying who is speaking. Due to the critical importance of this task in numerous applications, it has received considerable attention. Existing studies endeavor to enhance performance at any cost by inputting information from multiple candidates and designing complex models. While these methods have achieved excellent performance, their substantial memory and computational demands pose challenges for their application to resource-limited scenarios. Therefore, in this study, a lightweight and robust network for active speaker detection, named LR-ASD, is constructed by reducing the number of input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, using a simple channel attention module for multi-modal feature fusion, and applying gated recurrent unit (GRU) with low computational complexity for temporal modeling. Results on the AVA-ActiveSpeaker dataset reveal that LR-ASD achieves competitive mean Average Precision (mAP) performance (94.5% vs. 95.2%), while the resource costs are significantly lower than the state-of-the-art method, particularly in terms of model parameters (0.84 M vs. 34.33 M, approximately 41 times) and floating point operations (FLOPs) (0.51 G vs. 4.86 G, approximately 10 times). Additionally, LR-ASD demonstrates excellent robustness by achieving state-of-the-art performance on the Talkies, Columbia, and RealVAD datasets in cross-dataset testing without fine-tuning. The project is available at https://github.com/Junhua-Liao/LR-ASD.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"124 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143661509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PointSea: Point Cloud Completion via Self-structure Augmentation PointSea：通过自结构增强完成点云

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-19 DOI: 10.1007/s11263-025-02400-y

Zhe Zhu, Honghua Chen, Xing He, Mingqiang Wei

{"title":"PointSea: Point Cloud Completion via Self-structure Augmentation","authors":"Zhe Zhu, Honghua Chen, Xing He, Mingqiang Wei","doi":"10.1007/s11263-025-02400-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02400-y","url":null,"abstract":"Point cloud completion is a fundamental yet not well-solved problem in 3D vision. Current approaches often rely on 3D coordinate information and/or additional data (e.g., images and scanning viewpoints) to fill in missing parts. Unlike these methods, we explore self-structure augmentation and propose PointSea for global-to-local point cloud completion. In the global stage, consider how we inspect a defective region of a physical object, we may observe it from various perspectives for a better understanding. Inspired by this, PointSea augments data representation by leveraging self-projected depth images from multiple views. To reconstruct a compact global shape from the cross-modal input, we incorporate a feature fusion module to fuse features at both intra-view and inter-view levels. In the local stage, to reveal highly detailed structures, we introduce a point generator called the self-structure dual-generator. This generator integrates both learned shape priors and geometric self-similarities for shape refinement. Unlike existing efforts that apply a unified strategy for all points, our dual-path design adapts refinement strategies conditioned on the structural type of each point, addressing the specific incompleteness of each point. Comprehensive experiments on widely-used benchmarks demonstrate that PointSea effectively understands global shapes and generates local details from incomplete input, showing clear improvements over existing methods. Our code is available at https://github.com/czvvd/SVDFormer_PointSea.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143653346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0