International Journal of Computer Vision最新文献_第5页

Exploring Bidirectional Bounds for Minimax-Training of Energy-Based Models 探索基于能量模型的极大极小训练的双向边界

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-13 DOI: 10.1007/s11263-025-02460-0

Cong Geng, Jia Wang, Li Chen, Zhiyong Gao, Jes Frellsen, Søren Hauberg

{"title":"Exploring Bidirectional Bounds for Minimax-Training of Energy-Based Models","authors":"Cong Geng, Jia Wang, Li Chen, Zhiyong Gao, Jes Frellsen, Søren Hauberg","doi":"10.1007/s11263-025-02460-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02460-0","url":null,"abstract":"Energy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning that we maximize a lower bound and minimize an upper bound when training the EBM. We investigate four different bounds on the log-likelihood derived from different perspectives. We derive lower bounds based on the singular values of the generator Jacobian and on mutual information. To upper bound the negative log-likelihood, we consider a gradient penalty-like bound, as well as one based on diffusion processes. In all cases, we provide algorithms for evaluating the bounds. We compare the different bounds to investigate, the pros and cons of the different approaches. Finally, we demonstrate that the use of bidirectional bounds stabilizes EBM training and yields high-quality density estimation and sample generation.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143940382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Norm Regularization Training Strategy for Robust Image Quality Assessment Models 鲁棒图像质量评估模型的范数正则化训练策略

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-12 DOI: 10.1007/s11263-025-02458-8

Yujia Liu, Chenxi Yang, Dingquan Li, Tingting Jiang, Tiejun Huang

{"title":"A Norm Regularization Training Strategy for Robust Image Quality Assessment Models","authors":"Yujia Liu, Chenxi Yang, Dingquan Li, Tingting Jiang, Tiejun Huang","doi":"10.1007/s11263-025-02458-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02458-8","url":null,"abstract":"Image Quality Assessment (IQA) models predict the quality score of input images. They can be categorized into Full-Reference (FR-) and No-Reference (NR-) IQA models based on the availability of reference images. These models are essential for performance evaluation and optimization guidance in the media industry. However, researchers have observed that introducing imperceptible perturbations to input images can notably influence the predicted scores of both FR- and NR-IQA models, resulting in inaccurate assessments of image quality. This phenomenon is known as adversarial attacks. In this paper, we initially define attacks targeted at both FR-IQA and NR-IQA models. Subsequently, we introduce a defense approach applicable to both types of models, aimed at enhancing the stability of predicted scores and boosting the adversarial robustness of IQA models. To be specific, we present theoretical evidence showing that the magnitude of score changes is related to the (ell _1) norm of the model’s gradient with respect to the input image. Building upon this theoretical foundation, we propose a norm regularization training strategy aimed at reducing the (ell _1) norm of the gradient, thereby boosting the robustness of IQA models. Experiments conducted on three FR-IQA and four NR-IQA models demonstrate the effectiveness of our strategy in reducing score changes in the presence of adversarial attacks. To the best of our knowledge, this work marks the first attempt to defend against adversarial attacks on both FR- and NR-IQA models. Our study offers valuable insights into the adversarial robustness of IQA models and provides a foundation for future research in this area.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143933579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Information Theory-Inspired Strategy for Automated Network Pruning 一种基于信息理论的网络自动修剪策略

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-12 DOI: 10.1007/s11263-025-02437-z

Xiawu Zheng, Yuexiao Ma, Teng Xi, Gang Zhang, Errui Ding, Yuchao Li, Jie Chen, Yonghong Tian, Rongrong Ji

{"title":"An Information Theory-Inspired Strategy for Automated Network Pruning","authors":"Xiawu Zheng, Yuexiao Ma, Teng Xi, Gang Zhang, Errui Ding, Yuchao Li, Jie Chen, Yonghong Tian, Rongrong Ji","doi":"10.1007/s11263-025-02437-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02437-z","url":null,"abstract":"Despite superior performance achieved on many computer vision tasks, deep neural networks demand high computing power and memory footprint. Most existing network pruning methods require laborious human efforts and prohibitive computation resources, especially when the constraints are changed. This practically limits the application of model compression when the model needs to be deployed on a wide range of devices. Besides, existing methods are still challenged by the missing theoretical guidance, which lacks influence on the generalization error. In this paper we propose an information theory-inspired strategy for automated network pruning. The principle behind our method is the information bottleneck theory. Concretely, we introduce a new theorem to illustrate that the hidden representation should compress information with each other to achieve a better generalization. In this way, we further introduce the normalized Hilbert-Schmidt Independence Criterion on network activations as a stable and generalized indicator to construct layer importance. When a certain resource constraint is given, we integrate the HSIC indicator with the constraint to transform the architecture search problem into a linear programming problem with quadratic constraints. Such a problem is easily solved by a convex optimization method within a few seconds. We also provide rigorous proof to reveal that optimizing the normalized HSIC simultaneously minimizes the mutual information between different layers. Without any search process, our method achieves better compression trade-offs compared to the state-of-the-art compression algorithms. For instance, on ResNet-50, we achieve a 45.3%-FLOPs reduction, with a 75.75 top-1 accuracy on ImageNet. Codes are available at https://github.com/MAC-AutoML/ITPruner.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"74 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143940232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Autoregressive Temporal Modeling for Advanced Tracking-by-Diffusion 高级扩散跟踪的自回归时间模型

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-09 DOI: 10.1007/s11263-025-02439-x

Pha Nguyen, Rishi Madhok, Bhiksha Raj, Khoa Luu

{"title":"Autoregressive Temporal Modeling for Advanced Tracking-by-Diffusion","authors":"Pha Nguyen, Rishi Madhok, Bhiksha Raj, Khoa Luu","doi":"10.1007/s11263-025-02439-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02439-x","url":null,"abstract":"Object tracking is a widely studied computer vision task with video and instance analysis applications. While paradigms such as tracking-by-regression,-detection,-attention have advanced the field, generative modeling offers new potential. Although some studies explore the generative process in instance-based understanding tasks, they rely on prediction refinement in the coordinate space rather than the visual domain. Instead, this paper presents Tracking-by-Diffusion, a novel paradigm for object tracking in video, leveraging visual generative models via the perspective of autoregressive models. This paradigm demonstrates broad applicability across point, box, and mask modalities while uniquely enabling textual guidance. We present DIFTracker, a framework that utilizes iterative latent variable diffusion models to redefine tracking as a next-frame reconstruction task. Our approach uniquely combines spatial and temporal dependencies in video data, offering a unified solution that encompasses existing tracking paradigms within a single Inversion-Reconstruction process. DIFTracker operates online and auto-regressively, enabling flexible instance-based video understanding. It allows us to overcome difficulties in variable-length video understanding encountered by video-inflated models and perform superior performance on seven benchmarks across five modalities. This paper not only introduces a new perspective on visual autoregressive modeling in understanding sequential visual data, specifically videos, but also provides robust theoretical validations and demonstrates broader applications in visual tracking and computer vision.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"17 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143927313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation 基于上下文自动发现的跨语言图像匹配弱监督语义分割

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-09 DOI: 10.1007/s11263-025-02442-2

Jinheng Xie, Songhe Deng, Xianxu Hou, Zhaochuan Luo, Linlin Shen, Yawen Huang, Yefeng Zheng, Mike Zheng Shou

{"title":"CLIMS++: Cross Language Image Matching with Automatic Context Discovery for Weakly Supervised Semantic Segmentation","authors":"Jinheng Xie, Songhe Deng, Xianxu Hou, Zhaochuan Luo, Linlin Shen, Yawen Huang, Yefeng Zheng, Mike Zheng Shou","doi":"10.1007/s11263-025-02442-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02442-2","url":null,"abstract":"While promising results have been achieved in weakly-supervised semantic segmentation (WSSS), limited supervision from image-level tags inevitably induces discriminative reliance and spurious relations between target classes and background regions. Thus, Class Activation Map (CAM) usually tends to activate discriminative object regions and falsely includes lots of class-related backgrounds. Without pixel-level supervisions, it could be very difficult to enlarge the foreground activation and suppress those false activation of background regions. In this paper, we propose a novel framework of Cross Language Image Matching with Automatic Context Discovery (CLIMS++), based on the recently introduced Contrastive Language-Image Pre-training (CLIP) model, for WSSS. The core idea of our framework is to introduce natural language supervision to activate more complete object regions and suppress class-related background regions in CAM. In particular, we design object, background region, and text label matching losses to guide the model to excite more reasonable object regions of each category. In addition, we propose to automatically find spurious relations between foreground categories and backgrounds, through which a background suppression loss is designed to suppress the activation of class-related backgrounds. The above designs enable the proposed CLIMS++ to generate a more complete and compact activation map for the target objects. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 datasets show that our CLIMS++ significantly outperforms the previous state-of-the-art methods.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"126 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143931238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving HiLM-D：基于自动驾驶的多尺度高分辨率细节增强mlm

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-07 DOI: 10.1007/s11263-025-02433-3

Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li

{"title":"HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving","authors":"Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li","doi":"10.1007/s11263-025-02433-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02433-3","url":null,"abstract":"Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories rather than the semantic or appearance information (e.g., the shapes and colors) of objects. Hence, the visual process of HiLM-D is a two-stream framework: (i) a temporal reasoning stream, receiving low-resolution dynamic video content, to capture temporal semantics, and (ii) a spatial perception stream, receiving a single high-resolution frame, to capture holistic visual perception-related information. The spatial perception stream can be made very lightweight by a well-designed P-Adapter, which is lightweight, training-efficient, and easily integrated into existing MLLMs. Experiments on the DRAMA-ROLISP dataset show HiLM-D’s significant improvements over current MLLMs, with a (3.7%) in BLEU-4 for captioning and (8.7%) in mIoU for detection. Further tests on the Shikra-RD dataset confirm our method’s generalization capabilities. The DRAMA-ROLISP is available at https://github.com/xmed-lab/HiLM-D.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143920593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning 后门板凳：后门学习的综合标杆与分析

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-06 DOI: 10.1007/s11263-025-02447-x

Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, Mingli Zhu, Ruotong Wang, Li Liu, Chao Shen

{"title":"BackdoorBench: A Comprehensive Benchmark and Analysis of Backdoor Learning","authors":"Baoyuan Wu, Hongrui Chen, Mingda Zhang, Zihao Zhu, Shaokui Wei, Danni Yuan, Mingli Zhu, Ruotong Wang, Li Liu, Chao Shen","doi":"10.1007/s11263-025-02447-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02447-x","url":null,"abstract":"In recent years, backdoor learning has attracted increasing attention due to its effectiveness on investigating the adversarial vulnerability of artificial intelligence (AI) systems. Several seminal backdoor attack and defense algorithms have been developed, forming an increasingly fierce arms race. However, since backdoor learning involves various factors in different stages of an AI system (e.g., data preprocessing, model training algorithm, model activation), there have been diverse settings in existing works, causing unfair comparisons or unreliable conclusions (e.g., misleading, biased, or even false conclusions). Hence, it is urgent to build a unified and standardized benchmark of backdoor learning, such that we can track real progress and design a roadmap for the future development of this literature. To that end, we construct a comprehensive benchmark of backdoor learning, dubbed BackdoorBench. Our benchmark makes three valuable contributions to the research community. (1) We provide an integrated implementation of representative backdoor learning algorithms (currently including 20 attack algorithms and 32 defense algorithms), based on an extensible modular-based codebase. (2) We conduct comprehensive evaluations of the implemented algorithms on 4 models and 4 datasets, leading to 11,492 pairs of attack-against-defense evaluations in total. (3) Based on above evaluations, we present abundant analysis from 10 perspectives via 23 analysis tools, and reveal several inspiring insights about backdoor learning. We hope that our efforts could build a solid foundation of backdoor learning to facilitate researchers to investigate existing algorithms, develop more innovative algorithms, and explore the intrinsic mechanism of backdoor learning. Finally, we have created a user-friendly website at https://backdoorbench.github.io/, which collects all the important information of BackdoorBench, including the link to Codebase, Docs, Leaderboard, and Model Zoo.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"115 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143910342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Paragraph-to-Image Generation with Information-Enriched Diffusion Model 基于信息扩散模型的段落到图像生成

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-05 DOI: 10.1007/s11263-025-02435-1

Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang

{"title":"Paragraph-to-Image Generation with Information-Enriched Diffusion Model","authors":"Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang","doi":"10.1007/s11263-025-02435-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02435-1","url":null,"abstract":"Text-to-image models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LoRA to align the text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to (45%) human voting rate improvements for text faithfulness. Code and data can be found at: https://github.com/weijiawu/ParaDiffusion.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"99 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143910663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

P2Object: Single Point Supervised Object Detection and Instance Segmentation P2Object：单点监督对象检测和实例分割

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-03 DOI: 10.1007/s11263-025-02441-3

Pengfei Chen, Xuehui Yu, Xumeng Han, Kuiran Wang, Guorong Li, Lingxi Xie, Zhenjun Han, Jianbin Jiao

{"title":"P2Object: Single Point Supervised Object Detection and Instance Segmentation","authors":"Pengfei Chen, Xuehui Yu, Xumeng Han, Kuiran Wang, Guorong Li, Lingxi Xie, Zhenjun Han, Jianbin Jiao","doi":"10.1007/s11263-025-02441-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02441-3","url":null,"abstract":"Object recognition using single-point supervision has attracted increasing attention recently. However, the performance gap compared with fully-supervised algorithms remains large. Previous works generated class-agnostic proposals in an image offline and then treated mixed candidates as a single bag, putting a huge burden on multiple instance learning (MIL). In this paper, we introduce Point-to-Box Network (P2BNet), which constructs balanced instance-level proposal bags by generating proposals in an anchor-like way and refining the proposals in a coarse-to-fine paradigm. Through further research, we find that the bag of proposals, either at the image level or the instance level, is established on discrete box sampling. This leads the pseudo box estimation into a sub-optimal solution, resulting in the truncation of object boundaries or the excessive inclusion of background. Hence, we conduct a series exploration of discrete-to-continuous optimization, yielding P2BNet++ and Point-to-Mask Network (P2MNet). P2BNet++ conducts an approximately continuous proposal sampling strategy by better utilizing spatial clues. P2MNet further introduces low-level image information to assist in pixel prediction, and a boundary self-prediction is designed to relieve the limitation of the estimated boxes. Benefiting from the continuous object-aware pixel-level perception, P2MNet can generate more precise bounding boxes and generalize to segmentation tasks. Our method largely surpasses the previous methods in terms of the mean average precision on COCO, VOC, SBD, and Cityscapes, demonstrating great potential to bridge the performance gap compared with fully supervised tasks.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"97 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143901568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos 有效地利用CLIP生成图像和视频的情景摘要

IF 19.5 2区计算机科学

International Journal of Computer Vision Pub Date : 2025-05-03 DOI: 10.1007/s11263-025-02429-z

Dhruv Verma, Debaditya Roy, Basura Fernando

{"title":"Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos","authors":"Dhruv Verma, Debaditya Roy, Basura Fernando","doi":"10.1007/s11263-025-02429-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02429-z","url":null,"abstract":"Situation recognition refers to the ability of an agent to identify and understand various situations or contexts based on available information and sensory inputs. It involves the cognitive process of interpreting data from the environment to determine what is happening, what factors are involved, and what actions caused those situations. This interpretation of situations is formulated as a semantic role labeling problem in computer vision-based situation recognition. Situations depicted in images and videos hold pivotal information, essential for various applications like image and video captioning, multimedia retrieval, autonomous systems and event monitoring. However, existing methods often struggle with ambiguity and lack of context in generating meaningful and accurate predictions. Leveraging multimodal models such as CLIP, we propose ClipSitu, which sidesteps the need for full fine-tuning and achieves state-of-the-art results in situation recognition and localization tasks. ClipSitu harnesses CLIP-based image, verb, and role embeddings to predict nouns fulfilling all the roles associated with a verb, providing a comprehensive understanding of depicted scenarios. Through a cross-attention transformer, ClipSitu XTF enhances the connection between semantic role queries and visual token representations, leading to superior performance in situation recognition. We also propose a verb-wise role prediction model with near-perfect accuracy to create an end-to-end framework for producing situational summaries for out-of-domain images. We show that situational summaries empower our ClipSitu models to produce structured descriptions with reduced ambiguity compared to generic captions. Finally, we extend ClipSitu to video situation recognition to showcase its versatility and produce comparable performance to state-of-the-art methods. In summary, ClipSitu offers a robust solution to the challenge of semantic role labeling providing a way for structured understanding of visual media. ClipSitu advances the state-of-the-art in situation recognition, paving the way for a more nuanced and contextually relevant understanding of visual content that potentially could derive meaningful insights about the environment that agents observe. Code is available at https://github.com/LUNAProject22/CLIPSitu.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"53 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143901569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0