International Journal of Computer Vision最新文献_第9页

LaMD: Latent Motion Diffusion for Image-Conditional Video Generation LaMD：用于图像条件视频生成的潜在运动扩散

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-03-03 DOI: 10.1007/s11263-025-02386-7

Yaosi Hu, Zhenzhong Chen, Chong Luo

{"title":"LaMD: Latent Motion Diffusion for Image-Conditional Video Generation","authors":"Yaosi Hu, Zhenzhong Chen, Chong Luo","doi":"10.1007/s11263-025-02386-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02386-7","url":null,"abstract":"The video generation field has witnessed rapid improvements with the introduction of recent diffusion models. While these models have successfully enhanced appearance quality, they still face challenges in generating coherent and natural movements while efficiently sampling videos. In this paper, we propose to condense video generation into a problem of motion generation, to improve the expressiveness of motion and make video generation more manageable. This can be achieved by breaking down the video generation process into latent motion generation and video reconstruction. Specifically, we present a latent motion diffusion (LaMD) framework, which consists of a motion-decomposed video autoencoder and a diffusion-based motion generator, to implement this idea. Through careful design, the motion-decomposed video autoencoder can compress patterns in movement into a concise latent motion representation. Consequently, the diffusion-based motion generator is able to efficiently generate realistic motion on a continuous latent space under multi-modal conditions, at a cost that is similar to that of image diffusion models. Results show that LaMD generates high-quality videos on various benchmark datasets, including BAIR, Landscape, NATOPS, MUG and CATER-GEN, that encompass a variety of stochastic dynamics and highly controllable movements on multiple image-conditional video generation tasks, while significantly decreases sampling time.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143532572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LMD: Light-Weight Prediction Quality Estimation for Object Detection in Lidar Point Clouds LMD：激光雷达点云中目标检测的轻量级预测质量估计

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-02-28 DOI: 10.1007/s11263-025-02377-8

Tobias Riedlinger, Marius Schubert, Sarina Penquitt, Jan-Marcel Kezmann, Pascal Colling, Karsten Kahl, Lutz Roese-Koerner, Michael Arnold, Urs Zimmermann, Matthias Rottmann

{"title":"LMD: Light-Weight Prediction Quality Estimation for Object Detection in Lidar Point Clouds","authors":"Tobias Riedlinger, Marius Schubert, Sarina Penquitt, Jan-Marcel Kezmann, Pascal Colling, Karsten Kahl, Lutz Roese-Koerner, Michael Arnold, Urs Zimmermann, Matthias Rottmann","doi":"10.1007/s11263-025-02377-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02377-8","url":null,"abstract":"Object detection on Lidar point cloud data is a promising technology for autonomous driving and robotics which has seen a significant rise in performance and accuracy during recent years. Particularly uncertainty estimation is a crucial component for down-stream tasks and deep neural networks remain error-prone even for predictions with high confidence. Previously proposed methods for quantifying prediction uncertainty tend to alter the training scheme of the detector or rely on prediction sampling which results in vastly increased inference time. In order to address these two issues, we propose LidarMetaDetect (LMD), a light-weight post-processing scheme for prediction quality estimation. Our method can easily be added to any pre-trained Lidar object detector without altering anything about the base model and is purely based on post-processing, therefore, only leading to a negligible computational overhead. Our experiments show a significant increase of statistical reliability in separating true from false predictions. We show that this improvement carries over to object detection performance when replacing the objectness score native to the object detector. We propose and evaluate an additional application of our method leading to the detection of annotation errors. Explicit samples and a conservative count of annotation error proposals indicates the viability of our method for large-scale datasets like KITTI and nuScenes. On the widely-used nuScenes test dataset, 43 out of the top 100 proposals of our method indicate, in fact, erroneous annotations.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"32 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143518814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Realistic Evaluation of Deep Active Learning for Image Classification and Semantic Segmentation 深度主动学习在图像分类和语义分割中的现实评价

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-02-28 DOI: 10.1007/s11263-025-02372-z

Sudhanshu Mittal, Joshua Niemeijer, Özgün Çiçek, Maxim Tatarchenko, Jan Ehrhardt, Jörg P. Schäfer, Heinz Handels, Thomas Brox

{"title":"Realistic Evaluation of Deep Active Learning for Image Classification and Semantic Segmentation","authors":"Sudhanshu Mittal, Joshua Niemeijer, Özgün Çiçek, Maxim Tatarchenko, Jan Ehrhardt, Jörg P. Schäfer, Heinz Handels, Thomas Brox","doi":"10.1007/s11263-025-02372-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02372-z","url":null,"abstract":"Active learning aims to reduce the high labeling cost involved in training machine learning models on large datasets by efficiently labeling only the most informative samples. Recently, deep active learning has shown success on various tasks. However, the conventional evaluation schemes are either incomplete or below par. This study critically assesses various active learning approaches, identifying key factors essential for choosing the most effective active learning method. It includes a comprehensive guide to obtain the best performance for each case, in image classification and semantic segmentation. For image classification, the AL methods improve by a large-margin when integrated with data augmentation and semi-supervised learning, but barely perform better than the random baseline. In this work, we evaluate them under more realistic settings and propose a more suitable evaluation protocol. For semantic segmentation, previous academic studies focused on diverse datasets with substantial annotation resources. In contrast, data collected in many driving scenarios is highly redundant, and most medical applications are subject to very constrained annotation budgets. The study evaluates active learning techniques under various conditions including data redundancy, the use of semi-supervised learning, and differing annotation budgets. As an outcome of our study, we provide a comprehensive usage guide to obtain the best performance for each case.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"90 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143518627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Trustworthiness Landscape of State-of-the-art Generative Models: A Survey and Outlook 最新生成模型的可信度景观：综述与展望

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-02-28 DOI: 10.1007/s11263-025-02375-w

Mingyuan Fan, Chengyu Wang, Cen Chen, Yang Liu, Jun Huang

引用次数: 0

Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation Fg-T2M++: llms增强的细粒度文本驱动人体运动生成

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-02-27 DOI: 10.1007/s11263-025-02392-9

Yin Wang, Mu Li, Jiapeng Liu, Zhiying Leng, Frederick W. B. Li, Ziyao Zhang, Xiaohui Liang

引用次数: 0

A Survey on Deep Stereo Matching in the Twenties 二十年代深度立体匹配调查

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-02-26 DOI: 10.1007/s11263-024-02331-0

Fabio Tosi, Luca Bartolomei, Matteo Poggi

{"title":"A Survey on Deep Stereo Matching in the Twenties","authors":"Fabio Tosi, Luca Bartolomei, Matteo Poggi","doi":"10.1007/s11263-024-02331-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02331-0","url":null,"abstract":"Stereo matching is close to hitting a half-century of history, yet witnessed a rapid evolution in the last decade thanks to deep learning. While previous surveys in the late 2010s covered the first stage of this revolution, the last five years of research brought further ground-breaking advancements to the field. This paper aims to fill this gap in a two-fold manner: first, we offer an in-depth examination of the latest developments in deep stereo matching, focusing on the pioneering architectural designs and groundbreaking paradigms that have redefined the field in the 2020s; second, we present a thorough analysis of the critical challenges that have emerged alongside these advances, providing a comprehensive taxonomy of these issues and exploring the state-of-the-art techniques proposed to address them. By reviewing both the architectural innovations and the key challenges, we offer a holistic view of deep stereo matching and highlight the specific areas that require further investigation. To accompany this survey, we maintain a regularly updated project page that catalogs papers on deep stereo matching in our Awesome-Deep-Stereo-Matching repository.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143495333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation 文本到视频生成的时空扩散交换注意

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-02-24 DOI: 10.1007/s11263-025-02349-y

Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu

{"title":"Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation","authors":"Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu","doi":"10.1007/s11263-025-02349-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02349-y","url":null,"abstract":"With the explosive popularity of AI-generated content (AIGC), video generation has recently received a lot of attention. Generating videos guided by text instructions poses significant challenges, such as modeling the complex relationship between space and time, and the lack of large-scale text-video paired data. Existing text-video datasets suffer from limitations in both content quality and scale, or they are not open-source, rendering them inaccessible for study and use. For model design, previous approaches extend pretrained text-to-image generation models by adding temporal 1D convolution/attention modules for video generation. However, these approaches overlook the importance of jointly modeling space and time, inevitably leading to temporal distortions and misalignment between texts and videos. In this paper, we propose a novel approach that strengthens the interaction between spatial and temporal perceptions. In particular, we utilize a swapped cross-attention mechanism in 3D windows that alternates the “query” role between spatial and temporal blocks, enabling mutual reinforcement for each other. Moreover, to fully unlock model capabilities for high-quality video generation and promote the development of the field, we curate a large-scale and open-source video dataset called HD-VG-130M. This dataset comprises 130 million text-video pairs from the open-domain, ensuring high-definition, widescreen and watermark-free characters. A smaller-scale yet more meticulously cleaned subset further enhances the data quality, aiding models in achieving superior performance. Experimental quantitative and qualitative results demonstrate the superiority of our approach in terms of per-frame quality, temporal correlation, and text-video alignment, with clear margins.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"4 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143477342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Informative Scene Graph Generation via Debiasing 通过去偏生成信息场景图

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-02-24 DOI: 10.1007/s11263-025-02365-y

Lianli Gao, Xinyu Lyu, Yuyu Guo, Yuxuan Hu, Yuan-Fang Li, Lu Xu, Heng Tao Shen, Jingkuan Song

{"title":"Informative Scene Graph Generation via Debiasing","authors":"Lianli Gao, Xinyu Lyu, Yuyu Guo, Yuxuan Hu, Yuan-Fang Li, Lu Xu, Heng Tao Shen, Jingkuan Song","doi":"10.1007/s11263-025-02365-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02365-y","url":null,"abstract":"Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object). Due to biases in data, current models tend to predict common predicates, e.g., “on” and “at”, instead of informative ones, e.g., “standing on” and “looking at”. This tendency results in the loss of precise information and overall performance. If a model only uses “stone on road” rather than “stone blocking road” to describe an image, it may be a grave misunderstanding. We argue that this phenomenon is caused by two imbalances: semantic space level imbalance and training sample level imbalance. For this problem, we propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting. It integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL), for these imbalances. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships. BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates. Benefiting from the model-agnostic process, our method can be easily applied to SGG models and outperforms Transformer by (136.3%), (119.5%), and (122.6%) on mR@20 at three SGG sub-tasks on the SGG-VG dataset. Our method is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"3 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143485948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DustNet++: Deep Learning-Based Visual Regression for Dust Density Estimation dustnet++：基于深度学习的粉尘密度估计视觉回归

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-02-24 DOI: 10.1007/s11263-025-02376-9

Andreas Michel, Martin Weinmann, Jannick Kuester, Faisal AlNasser, Tomas Gomez, Mark Falvey, Rainer Schmitz, Wolfgang Middelmann, Stefan Hinz

{"title":"DustNet++: Deep Learning-Based Visual Regression for Dust Density Estimation","authors":"Andreas Michel, Martin Weinmann, Jannick Kuester, Faisal AlNasser, Tomas Gomez, Mark Falvey, Rainer Schmitz, Wolfgang Middelmann, Stefan Hinz","doi":"10.1007/s11263-025-02376-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02376-9","url":null,"abstract":"Detecting airborne dust in standard RGB images presents significant challenges. Nevertheless, the monitoring of airborne dust holds substantial potential benefits for climate protection, environmentally sustainable construction, scientific research, and various other fields. To develop an efficient and robust algorithm for airborne dust monitoring, several hurdles have to be addressed. Airborne dust can be opaque or translucent, exhibit considerable variation in density, and possess indistinct boundaries. Moreover, distinguishing dust from other atmospheric phenomena, such as fog or clouds, can be particularly challenging. To meet the demand for a high-performing and reliable method for monitoring airborne dust, we introduce DustNet++, a neural network designed for dust density estimation. DustNet++ leverages feature maps from multiple resolution scales and semantic levels through window and grid attention mechanisms to maintain a sparse, globally effective receptive field with linear complexity. To validate our approach, we benchmark the performance of DustNet++ against existing methods from the domains of crowd counting and monocular depth estimation using the Meteodata airborne dust dataset and the URDE binary dust segmentation dataset. Our findings demonstrate that DustNet++ surpasses comparative methodologies in terms of regression and localization capabilities.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"56 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143477343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks 基于因果连体网络的单幅图像离焦去模糊连续测试时间自适应

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-02-22 DOI: 10.1007/s11263-025-02363-0

Shuang Cui, Yi Li, Jiangmeng Li, Xiongxin Tang, Bing Su, Fanjiang Xu, Hui Xiong

{"title":"Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks","authors":"Shuang Cui, Yi Li, Jiangmeng Li, Xiongxin Tang, Bing Su, Fanjiang Xu, Hui Xiong","doi":"10.1007/s11263-025-02363-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02363-0","url":null,"abstract":"Single image defocus deblurring (SIDD) aims to restore an all-in-focus image from a defocused one. Distribution shifts in defocused images generally lead to performance degradation of existing methods during out-of-distribution inferences. In this work, we gauge the intrinsic reason behind the performance degradation, which is identified as the heterogeneity of lens-specific point spread functions. Empirical evidence supports this finding, motivating us to employ a continual test-time adaptation (CTTA) paradigm for SIDD. However, traditional CTTA methods, which primarily rely on entropy minimization, cannot sufficiently explore task-dependent information for pixel-level regression tasks like SIDD. To address this issue, we propose a novel Siamese networks-based continual test-time adaptation framework, which adapts source models to continuously changing target domains only requiring unlabeled target data in an online manner. To further mitigate semantically erroneous textures introduced by source SIDD models under severe degradation, we revisit the learning paradigm through a structural causal model and propose Causal Siamese networks (CauSiam). Our method leverages large-scale pre-trained vision-language models to derive discriminative universal semantic priors and integrates these priors into Siamese networks, ensuring causal identifiability between blurry inputs and restored images. Extensive experiments demonstrate that CauSiam effectively improves the generalization performance of existing SIDD methods in continuously changing domains.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"61 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143473594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0