{"title":"IPAD: Iterative, Parallel, and Diffusion-Based Network for Scene Text Recognition","authors":"Xiaomeng Yang, Zhi Qiao, Yu Zhou","doi":"10.1007/s11263-025-02443-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02443-1","url":null,"abstract":"<p>Nowadays, scene text recognition has attracted more and more attention due to its diverse applications. Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right. Despite the convincing performance, this sequential decoding strategy constrains the inference speed. Conversely, non-autoregressive models provide faster, simultaneous predictions but often sacrifice accuracy. Although utilizing an explicit language model can improve performance, it burdens the computational load. Besides, separating linguistic knowledge from vision information may harm the final prediction. In this paper, we propose an alternative solution that uses a parallel and iterative decoder that adopts an easy-first decoding strategy. Furthermore, we regard text recognition as an image-based conditional text generation task and utilize the discrete diffusion strategy, ensuring exhaustive exploration of bidirectional contextual information. Extensive experiments demonstrate that the proposed approach achieves superior results on the benchmark datasets, including both Chinese and English text images.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"17 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143979615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanhan Zhang, Qinghong Sun, Yichun Zhou, Zexin He, Zhenfei Yin, Kun Wang, Lu Sheng, Yu Qiao, Jing Shao, Ziwei Liu
{"title":"Bamboo: Building Mega-Scale Vision Dataset Continually with Human–Machine Synergy","authors":"Yuanhan Zhang, Qinghong Sun, Yichun Zhou, Zexin He, Zhenfei Yin, Kun Wang, Lu Sheng, Yu Qiao, Jing Shao, Ziwei Liu","doi":"10.1007/s11263-025-02450-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02450-2","url":null,"abstract":"<p>Large-scale datasets play a vital role in computer vision. But current datasets are annotated blindly without differentiation to samples, making the data collection inefficient and unscalable. The open question is how to build a mega-scale dataset actively. Although advanced active learning algorithms might be the answer, we experimentally found that they are lame in the realistic annotation scenario where out-of-distribution data is extensive. This work thus proposes a novel active learning framework for realistic dataset annotation. Equipped with this framework, we build a high-quality vision dataset—<b>Bamboo</b>, which consists of 69M image classification annotations with 119K categories and 28M object bounding box annotations with 809 categories. We organize these categories by a hierarchical taxonomy integrated from several knowledge bases. The classification annotations are four times larger than ImageNet22K, and that of detection is three times larger than Object365. Compared to ImageNet22K and Objects365, models pre-trained on Bamboo achieve superior performance among various downstream tasks (6.2% gains on classification and 2.1% gains on detection). We believe our active learning framework and Bamboo are essential for future work. Code and dataset are available at https://github.com/ZhangYuanhan-AI/Bamboo.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"123 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143940383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Bidirectional Bounds for Minimax-Training of Energy-Based Models","authors":"Cong Geng, Jia Wang, Li Chen, Zhiyong Gao, Jes Frellsen, Søren Hauberg","doi":"10.1007/s11263-025-02460-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02460-0","url":null,"abstract":"<p>Energy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning that we maximize a lower bound and minimize an upper bound when training the EBM. We investigate four different bounds on the log-likelihood derived from different perspectives. We derive lower bounds based on the singular values of the generator Jacobian and on mutual information. To upper bound the negative log-likelihood, we consider a gradient penalty-like bound, as well as one based on diffusion processes. In all cases, we provide algorithms for evaluating the bounds. We compare the different bounds to investigate, the pros and cons of the different approaches. Finally, we demonstrate that the use of bidirectional bounds stabilizes EBM training and yields high-quality density estimation and sample generation.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143940382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Norm Regularization Training Strategy for Robust Image Quality Assessment Models","authors":"Yujia Liu, Chenxi Yang, Dingquan Li, Tingting Jiang, Tiejun Huang","doi":"10.1007/s11263-025-02458-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02458-8","url":null,"abstract":"<p>Image Quality Assessment (IQA) models predict the quality score of input images. They can be categorized into Full-Reference (FR-) and No-Reference (NR-) IQA models based on the availability of reference images. These models are essential for performance evaluation and optimization guidance in the media industry. However, researchers have observed that introducing imperceptible perturbations to input images can notably influence the predicted scores of both FR- and NR-IQA models, resulting in inaccurate assessments of image quality. This phenomenon is known as adversarial attacks. In this paper, we initially define attacks targeted at both FR-IQA and NR-IQA models. Subsequently, we introduce a defense approach applicable to both types of models, aimed at enhancing the stability of predicted scores and boosting the adversarial robustness of IQA models. To be specific, we present theoretical evidence showing that the magnitude of score changes is related to the <span>(ell _1)</span> norm of the model’s gradient with respect to the input image. Building upon this theoretical foundation, we propose a norm regularization training strategy aimed at reducing the <span>(ell _1)</span> norm of the gradient, thereby boosting the robustness of IQA models. Experiments conducted on three FR-IQA and four NR-IQA models demonstrate the effectiveness of our strategy in reducing score changes in the presence of adversarial attacks. To the best of our knowledge, this work marks the first attempt to defend against adversarial attacks on both FR- and NR-IQA models. Our study offers valuable insights into the adversarial robustness of IQA models and provides a foundation for future research in this area.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143933579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiawu Zheng, Yuexiao Ma, Teng Xi, Gang Zhang, Errui Ding, Yuchao Li, Jie Chen, Yonghong Tian, Rongrong Ji
{"title":"An Information Theory-Inspired Strategy for Automated Network Pruning","authors":"Xiawu Zheng, Yuexiao Ma, Teng Xi, Gang Zhang, Errui Ding, Yuchao Li, Jie Chen, Yonghong Tian, Rongrong Ji","doi":"10.1007/s11263-025-02437-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02437-z","url":null,"abstract":"<p>Despite superior performance achieved on many computer vision tasks, deep neural networks demand high computing power and memory footprint. Most existing network pruning methods require laborious human efforts and prohibitive computation resources, especially when the constraints are changed. This practically limits the application of model compression when the model needs to be deployed on a wide range of devices. Besides, existing methods are still challenged by the missing theoretical guidance, which lacks influence on the generalization error. In this paper we propose an information theory-inspired strategy for automated network pruning. The principle behind our method is the information bottleneck theory. Concretely, we introduce a new theorem to illustrate that the hidden representation should compress information with each other to achieve a better generalization. In this way, we further introduce the normalized Hilbert-Schmidt Independence Criterion on network activations as a stable and generalized indicator to construct layer importance. When a certain resource constraint is given, we integrate the HSIC indicator with the constraint to transform the architecture search problem into a linear programming problem with quadratic constraints. Such a problem is easily solved by a convex optimization method within a few seconds. We also provide rigorous proof to reveal that optimizing the normalized HSIC simultaneously minimizes the mutual information between different layers. Without any search process, our method achieves better compression trade-offs compared to the state-of-the-art compression algorithms. For instance, on ResNet-50, we achieve a 45.3%-FLOPs reduction, with a 75.75 top-1 accuracy on ImageNet. Codes are available at https://github.com/MAC-AutoML/ITPruner.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"74 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143940232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Autoregressive Temporal Modeling for Advanced Tracking-by-Diffusion","authors":"Pha Nguyen, Rishi Madhok, Bhiksha Raj, Khoa Luu","doi":"10.1007/s11263-025-02439-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02439-x","url":null,"abstract":"<p>Object tracking is a widely studied computer vision task with video and instance analysis applications. While paradigms such as <i>tracking-by-regression</i>,<i>-detection</i>,<i>-attention</i> have advanced the field, generative modeling offers new potential. Although some studies explore the generative process in instance-based understanding tasks, they rely on prediction refinement in the coordinate space rather than the visual domain. Instead, this paper presents <i>Tracking-by-Diffusion</i>, a novel paradigm for object tracking in video, leveraging visual generative models via the perspective of autoregressive models. This paradigm demonstrates broad applicability across point, box, and mask modalities while uniquely enabling textual guidance. We present DIFTracker, a framework that utilizes iterative latent variable diffusion models to redefine tracking as a next-frame reconstruction task. Our approach uniquely combines spatial and temporal dependencies in video data, offering a unified solution that encompasses existing tracking paradigms within a single Inversion-Reconstruction process. DIFTracker operates online and auto-regressively, enabling flexible instance-based video understanding. It allows us to overcome difficulties in variable-length video understanding encountered by video-inflated models and perform superior performance on seven benchmarks across five modalities. This paper not only introduces a new perspective on visual autoregressive modeling in understanding sequential visual data, specifically videos, but also provides robust theoretical validations and demonstrates broader applications in visual tracking and computer vision.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"17 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143927313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation","authors":"Jinheng Xie, Songhe Deng, Xianxu Hou, Zhaochuan Luo, Linlin Shen, Yawen Huang, Yefeng Zheng, Mike Zheng Shou","doi":"10.1007/s11263-025-02442-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02442-2","url":null,"abstract":"<p>While promising results have been achieved in weakly-supervised semantic segmentation (WSSS), limited supervision from image-level tags inevitably induces discriminative reliance and spurious relations between target classes and background regions. Thus, Class Activation Map (CAM) usually tends to activate discriminative object regions and falsely includes lots of class-related backgrounds. Without pixel-level supervisions, it could be very difficult to enlarge the foreground activation and suppress those false activation of background regions. In this paper, we propose a novel framework of Cross Language Image Matching with Automatic Context Discovery (CLIMS++), based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress class-related background regions in CAM. In particular, we design object, background region, and text label matching losses to guide the model to excite more reasonable object regions of each category. In addition, we propose to automatically find spurious relations between foreground categories and backgrounds, through which a background suppression loss is designed to suppress the activation of class-related backgrounds. The above designs enable the proposed CLIMS++ to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 datasets show that our CLIMS++ significantly outperforms the previous state-of-the-art methods.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"126 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143931238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li
{"title":"HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving","authors":"Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li","doi":"10.1007/s11263-025-02433-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02433-3","url":null,"abstract":"<p>Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories rather than the semantic or appearance information (e.g., the shapes and colors) of objects. Hence, the visual process of HiLM-D is a two-stream framework: (i) a temporal reasoning stream, receiving low-resolution dynamic video content, to capture temporal semantics, and (ii) a spatial perception stream, receiving a single high-resolution frame, to capture holistic visual perception-related information. The spatial perception stream can be made very lightweight by a well-designed P-Adapter, which is lightweight, training-efficient, and easily integrated into existing MLLMs. Experiments on the DRAMA-ROLISP dataset show HiLM-D’s significant improvements over current MLLMs, with a <span>(3.7%)</span> in BLEU-4 for captioning and <span>(8.7%)</span> in mIoU for detection. Further tests on the Shikra-RD dataset confirm our method’s generalization capabilities. The DRAMA-ROLISP is available at https://github.com/xmed-lab/HiLM-D.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143920593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning","authors":"Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, Mingli Zhu, Ruotong Wang, Li Liu, Chao Shen","doi":"10.1007/s11263-025-02447-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02447-x","url":null,"abstract":"<p>In recent years, backdoor learning has attracted increasing attention due to its effectiveness on investigating the adversarial vulnerability of artificial intelligence (AI) systems. Several seminal backdoor attack and defense algorithms have been developed, forming an increasingly fierce arms race. However, since backdoor learning involves various factors in different stages of an AI system (e.g., data preprocessing, model training algorithm, model activation), there have been diverse settings in existing works, causing unfair comparisons or unreliable conclusions (e.g., misleading, biased, or even false conclusions). Hence, it is urgent to build a unified and standardized benchmark of backdoor learning, such that we can track real progress and design a roadmap for the future development of this literature. To that end, we construct a comprehensive benchmark of backdoor learning, dubbed <i>BackdoorBench</i>. Our benchmark makes three valuable contributions to the research community. (1) We provide an integrated implementation of representative backdoor learning algorithms (currently including 20 attack algorithms and 32 defense algorithms), based on an extensible modular-based codebase. (2) We conduct comprehensive evaluations of the implemented algorithms on 4 models and 4 datasets, leading to 11,492 pairs of attack-against-defense evaluations in total. (3) Based on above evaluations, we present abundant analysis from 10 perspectives via 23 analysis tools, and reveal several inspiring insights about backdoor learning. We hope that our efforts could build a solid foundation of backdoor learning to facilitate researchers to investigate existing algorithms, develop more innovative algorithms, and explore the intrinsic mechanism of backdoor learning. Finally, we have created a user-friendly website at https://backdoorbench.github.io/, which collects all the important information of BackdoorBench, including the link to Codebase, Docs, Leaderboard, and Model Zoo.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"115 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143910342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang
{"title":"Paragraph-to-Image Generation with Information-Enriched Diffusion Model","authors":"Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang","doi":"10.1007/s11263-025-02435-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02435-1","url":null,"abstract":"<p>Text-to-image models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (<i>e.g.,</i> Llama V2) to encode long-form text, followed by fine-tuning with LoRA to align the text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to <span>(45%)</span> human voting rate improvements for text faithfulness. Code and data can be found at: https://github.com/weijiawu/ParaDiffusion.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"99 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143910663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}