Guosong Jiang, Pengfei Zhu, Bing Cao, Dongyue Chen, Qinghua Hu
{"title":"Unknown Support Prototype Set for Open Set Recognition","authors":"Guosong Jiang, Pengfei Zhu, Bing Cao, Dongyue Chen, Qinghua Hu","doi":"10.1007/s11263-025-02384-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02384-9","url":null,"abstract":"<p>In real-world applications, visual recognition systems inevitably encounter unknown classes which are not present in the training set. Open set recognition aims to classify samples from known classes and detect unknowns, simultaneously. One promising solution is to inject unknowns into training sets, and significant progress has been made on how to build an unknowns generator. However, what unknowns exhibit strong generalization is rarely explored. This work presents a new concept called <i>Unknown Support Prototypes</i>, which serve as good representatives for potential unknown classes. Two novel metrics coined <i>Support</i> and <i>Diversity</i> are introduced to construct <i>Unknown Support Prototype Set</i>. In the algorithm, we further propose to construct <i>Unknown Support Prototypes</i> in the semantic subspace of the feature space, which can largely reduce the cardinality of <i>Unknown Support Prototype Set</i> and enhance the reliability of unknowns generation. Extensive experiments on several benchmark datasets demonstrate the proposed algorithm offers effective generalization for unknowns.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143532573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LaMD: Latent Motion Diffusion for Image-Conditional Video Generation","authors":"Yaosi Hu, Zhenzhong Chen, Chong Luo","doi":"10.1007/s11263-025-02386-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02386-7","url":null,"abstract":"<p>The video generation field has witnessed rapid improvements with the introduction of recent diffusion models. While these models have successfully enhanced appearance quality, they still face challenges in generating coherent and natural movements while efficiently sampling videos. In this paper, we propose to condense video generation into a problem of motion generation, to improve the expressiveness of motion and make video generation more manageable. This can be achieved by breaking down the video generation process into latent motion generation and video reconstruction. Specifically, we present a latent motion diffusion (LaMD) framework, which consists of a motion-decomposed video autoencoder and a diffusion-based motion generator, to implement this idea. Through careful design, the motion-decomposed video autoencoder can compress patterns in movement into a concise latent motion representation. Consequently, the diffusion-based motion generator is able to efficiently generate realistic motion on a continuous latent space under multi-modal conditions, at a cost that is similar to that of image diffusion models. Results show that LaMD generates high-quality videos on various benchmark datasets, including BAIR, Landscape, NATOPS, MUG and CATER-GEN, that encompass a variety of stochastic dynamics and highly controllable movements on multiple image-conditional video generation tasks, while significantly decreases sampling time.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143532572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tobias Riedlinger, Marius Schubert, Sarina Penquitt, Jan-Marcel Kezmann, Pascal Colling, Karsten Kahl, Lutz Roese-Koerner, Michael Arnold, Urs Zimmermann, Matthias Rottmann
{"title":"LMD: Light-Weight Prediction Quality Estimation for Object Detection in Lidar Point Clouds","authors":"Tobias Riedlinger, Marius Schubert, Sarina Penquitt, Jan-Marcel Kezmann, Pascal Colling, Karsten Kahl, Lutz Roese-Koerner, Michael Arnold, Urs Zimmermann, Matthias Rottmann","doi":"10.1007/s11263-025-02377-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02377-8","url":null,"abstract":"<p>Object detection on Lidar point cloud data is a promising technology for autonomous driving and robotics which has seen a significant rise in performance and accuracy during recent years. Particularly uncertainty estimation is a crucial component for down-stream tasks and deep neural networks remain error-prone even for predictions with high confidence. Previously proposed methods for quantifying prediction uncertainty tend to alter the training scheme of the detector or rely on prediction sampling which results in vastly increased inference time. In order to address these two issues, we propose LidarMetaDetect (LMD), a light-weight post-processing scheme for prediction quality estimation. Our method can easily be added to any pre-trained Lidar object detector without altering anything about the base model and is purely based on post-processing, therefore, only leading to a negligible computational overhead. Our experiments show a significant increase of statistical reliability in separating true from false predictions. We show that this improvement carries over to object detection performance when replacing the objectness score native to the object detector. We propose and evaluate an additional application of our method leading to the detection of annotation errors. Explicit samples and a conservative count of annotation error proposals indicates the viability of our method for large-scale datasets like KITTI and nuScenes. On the widely-used nuScenes test dataset, 43 out of the top 100 proposals of our method indicate, in fact, erroneous annotations.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"32 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143518814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sudhanshu Mittal, Joshua Niemeijer, Özgün Çiçek, Maxim Tatarchenko, Jan Ehrhardt, Jörg P. Schäfer, Heinz Handels, Thomas Brox
{"title":"Realistic Evaluation of Deep Active Learning for Image Classification and Semantic Segmentation","authors":"Sudhanshu Mittal, Joshua Niemeijer, Özgün Çiçek, Maxim Tatarchenko, Jan Ehrhardt, Jörg P. Schäfer, Heinz Handels, Thomas Brox","doi":"10.1007/s11263-025-02372-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02372-z","url":null,"abstract":"<p>Active learning aims to reduce the high labeling cost involved in training machine learning models on large datasets by efficiently labeling only the most informative samples. Recently, deep active learning has shown success on various tasks. However, the conventional evaluation schemes are either incomplete or below par. This study critically assesses various active learning approaches, identifying key factors essential for choosing the most effective active learning method. It includes a comprehensive guide to obtain the best performance for each case, in image classification and semantic segmentation. For image classification, the AL methods improve by a large-margin when integrated with data augmentation and semi-supervised learning, but barely perform better than the random baseline. In this work, we evaluate them under more realistic settings and propose a more suitable evaluation protocol. For semantic segmentation, previous academic studies focused on diverse datasets with substantial annotation resources. In contrast, data collected in many driving scenarios is highly redundant, and most medical applications are subject to very constrained annotation budgets. The study evaluates active learning techniques under various conditions including data redundancy, the use of semi-supervised learning, and differing annotation budgets. As an outcome of our study, we provide a comprehensive usage guide to obtain the best performance for each case.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"90 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143518627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingyuan Fan, Chengyu Wang, Cen Chen, Yang Liu, Jun Huang
{"title":"On the Trustworthiness Landscape of State-of-the-art Generative Models: A Survey and Outlook","authors":"Mingyuan Fan, Chengyu Wang, Cen Chen, Yang Liu, Jun Huang","doi":"10.1007/s11263-025-02375-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02375-w","url":null,"abstract":"<p>Diffusion models and large language models have emerged as leading-edge generative models, revolutionizing various aspects of human life. However, their practical implementation has also exposed inherent risks, bringing to light their potential downsides and sparking concerns about their trustworthiness. Despite the wealth of literature on this subject, a comprehensive survey that specifically delves into the intersection of large-scale generative models and their trustworthiness remains largely absent. To bridge this gap, this paper investigates both long-standing and emerging threats associated with these models across four fundamental dimensions: 1) privacy, 2) security, 3) fairness, and 4) responsibility. Based on our investigation results, we develop an extensive survey that outlines the trustworthiness of large generative models. Following that, we provide practical recommendations and identify promising research directions for generative AI, ultimately promoting the trustworthiness of these models and benefiting society as a whole.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"63 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143518812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yin Wang, Mu Li, Jiapeng Liu, Zhiying Leng, Frederick W. B. Li, Ziyao Zhang, Xiaohui Liang
{"title":"Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation","authors":"Yin Wang, Mu Li, Jiapeng Liu, Zhiying Leng, Frederick W. B. Li, Ziyao Zhang, Xiaohui Liang","doi":"10.1007/s11263-025-02392-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02392-9","url":null,"abstract":"<p>We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an <i>LLMs semantic parsing module</i> to extract body part descriptions and semantics from text, (2) a <i>hyperbolic text representation module</i> to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a <i>multi-modal fusion module</i> to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143506920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey on Deep Stereo Matching in the Twenties","authors":"Fabio Tosi, Luca Bartolomei, Matteo Poggi","doi":"10.1007/s11263-024-02331-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02331-0","url":null,"abstract":"<p>Stereo matching is close to hitting a half-century of history, yet witnessed a rapid evolution in the last decade thanks to deep learning. While previous surveys in the late 2010s covered the first stage of this revolution, the last five years of research brought further ground-breaking advancements to the field. This paper aims to fill this gap in a two-fold manner: first, we offer an in-depth examination of the latest developments in deep stereo matching, focusing on the pioneering architectural designs and groundbreaking paradigms that have redefined the field in the 2020s; second, we present a thorough analysis of the critical challenges that have emerged alongside these advances, providing a comprehensive taxonomy of these issues and exploring the state-of-the-art techniques proposed to address them. By reviewing both the architectural innovations and the key challenges, we offer a holistic view of deep stereo matching and highlight the specific areas that require further investigation. To accompany this survey, we maintain a regularly updated project page that catalogs papers on deep stereo matching in our Awesome-Deep-Stereo-Matching repository.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143495333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu
{"title":"Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation","authors":"Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu","doi":"10.1007/s11263-025-02349-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02349-y","url":null,"abstract":"<p>With the explosive popularity of AI-generated content (AIGC), video generation has recently received a lot of attention. Generating videos guided by text instructions poses significant challenges, such as modeling the complex relationship between space and time, and the lack of large-scale text-video paired data. Existing text-video datasets suffer from limitations in both content quality and scale, or they are not open-source, rendering them inaccessible for study and use. For model design, previous approaches extend pretrained text-to-image generation models by adding temporal 1D convolution/attention modules for video generation. However, these approaches overlook the importance of jointly modeling space and time, inevitably leading to temporal distortions and misalignment between texts and videos. In this paper, we propose a novel approach that strengthens the interaction between spatial and temporal perceptions. In particular, we utilize a swapped cross-attention mechanism in 3D windows that alternates the “query” role between spatial and temporal blocks, enabling mutual reinforcement for each other. Moreover, to fully unlock model capabilities for high-quality video generation and promote the development of the field, we curate a large-scale and open-source video dataset called HD-VG-130M. This dataset comprises 130 million text-video pairs from the open-domain, ensuring high-definition, widescreen and watermark-free characters. A smaller-scale yet more meticulously cleaned subset further enhances the data quality, aiding models in achieving superior performance. Experimental quantitative and qualitative results demonstrate the superiority of our approach in terms of per-frame quality, temporal correlation, and text-video alignment, with clear margins.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"4 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143477342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lianli Gao, Xinyu Lyu, Yuyu Guo, Yuxuan Hu, Yuan-Fang Li, Lu Xu, Heng Tao Shen, Jingkuan Song
{"title":"Informative Scene Graph Generation via Debiasing","authors":"Lianli Gao, Xinyu Lyu, Yuyu Guo, Yuxuan Hu, Yuan-Fang Li, Lu Xu, Heng Tao Shen, Jingkuan Song","doi":"10.1007/s11263-025-02365-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02365-y","url":null,"abstract":"<p>Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object). Due to biases in data, current models tend to predict common predicates, <i>e</i>.<i>g</i>., “on” and “at”, instead of informative ones, <i>e</i>.<i>g</i>., “standing on” and “looking at”. This tendency results in the loss of precise information and overall performance. If a model only uses “stone on road” rather than “stone blocking road” to describe an image, it may be a grave misunderstanding. We argue that this phenomenon is caused by two imbalances: semantic space level imbalance and training sample level imbalance. For this problem, we propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting. It integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL), for these imbalances. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships. BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates. Benefiting from the model-agnostic process, our method can be easily applied to SGG models and outperforms Transformer by <span>(136.3%)</span>, <span>(119.5%)</span>, and <span>(122.6%)</span> on mR@20 at three SGG sub-tasks on the SGG-VG dataset. Our method is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"3 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143485948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Michel, Martin Weinmann, Jannick Kuester, Faisal AlNasser, Tomas Gomez, Mark Falvey, Rainer Schmitz, Wolfgang Middelmann, Stefan Hinz
{"title":"DustNet++: Deep Learning-Based Visual Regression for Dust Density Estimation","authors":"Andreas Michel, Martin Weinmann, Jannick Kuester, Faisal AlNasser, Tomas Gomez, Mark Falvey, Rainer Schmitz, Wolfgang Middelmann, Stefan Hinz","doi":"10.1007/s11263-025-02376-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02376-9","url":null,"abstract":"<p>Detecting airborne dust in standard RGB images presents significant challenges. Nevertheless, the monitoring of airborne dust holds substantial potential benefits for climate protection, environmentally sustainable construction, scientific research, and various other fields. To develop an efficient and robust algorithm for airborne dust monitoring, several hurdles have to be addressed. Airborne dust can be opaque or translucent, exhibit considerable variation in density, and possess indistinct boundaries. Moreover, distinguishing dust from other atmospheric phenomena, such as fog or clouds, can be particularly challenging. To meet the demand for a high-performing and reliable method for monitoring airborne dust, we introduce DustNet++, a neural network designed for dust density estimation. DustNet++ leverages feature maps from multiple resolution scales and semantic levels through window and grid attention mechanisms to maintain a sparse, globally effective receptive field with linear complexity. To validate our approach, we benchmark the performance of DustNet++ against existing methods from the domains of crowd counting and monocular depth estimation using the Meteodata airborne dust dataset and the URDE binary dust segmentation dataset. Our findings demonstrate that DustNet++ surpasses comparative methodologies in terms of regression and localization capabilities.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"56 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143477343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}