International Journal of Computer Vision最新文献

筛选
英文 中文
Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation Fg-T2M++: llms增强的细粒度文本驱动人体运动生成
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-27 DOI: 10.1007/s11263-025-02392-9
Yin Wang, Mu Li, Jiapeng Liu, Zhiying Leng, Frederick W. B. Li, Ziyao Zhang, Xiaohui Liang
{"title":"Fg-T2M++: LLMs-Augmented Fine-Grained Text Driven Human Motion Generation","authors":"Yin Wang, Mu Li, Jiapeng Liu, Zhiying Leng, Frederick W. B. Li, Ziyao Zhang, Xiaohui Liang","doi":"10.1007/s11263-025-02392-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02392-9","url":null,"abstract":"<p>We address the challenging problem of fine-grained text-driven human motion generation. Existing works generate imprecise motions that fail to accurately capture relationships specified in text due to: (1) lack of effective text parsing for detailed semantic cues regarding body parts, (2) not fully modeling linguistic structures between words to comprehend text comprehensively. To tackle these limitations, we propose a novel fine-grained framework Fg-T2M++ that consists of: (1) an <i>LLMs semantic parsing module</i> to extract body part descriptions and semantics from text, (2) a <i>hyperbolic text representation module</i> to encode relational information between text units by embedding the syntactic dependency graph into hyperbolic space, and (3) a <i>multi-modal fusion module</i> to hierarchically fuse text and motion features. Extensive experiments on HumanML3D and KIT-ML datasets demonstrate that Fg-T2M++ outperforms SOTA methods, validating its ability to accurately generate motions adhering to comprehensive text semantics.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143506920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation 文本到视频生成的时空扩散交换注意
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-24 DOI: 10.1007/s11263-025-02349-y
Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu
{"title":"Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation","authors":"Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu","doi":"10.1007/s11263-025-02349-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02349-y","url":null,"abstract":"<p>With the explosive popularity of AI-generated content (AIGC), video generation has recently received a lot of attention. Generating videos guided by text instructions poses significant challenges, such as modeling the complex relationship between space and time, and the lack of large-scale text-video paired data. Existing text-video datasets suffer from limitations in both content quality and scale, or they are not open-source, rendering them inaccessible for study and use. For model design, previous approaches extend pretrained text-to-image generation models by adding temporal 1D convolution/attention modules for video generation. However, these approaches overlook the importance of jointly modeling space and time, inevitably leading to temporal distortions and misalignment between texts and videos. In this paper, we propose a novel approach that strengthens the interaction between spatial and temporal perceptions. In particular, we utilize a swapped cross-attention mechanism in 3D windows that alternates the “query” role between spatial and temporal blocks, enabling mutual reinforcement for each other. Moreover, to fully unlock model capabilities for high-quality video generation and promote the development of the field, we curate a large-scale and open-source video dataset called HD-VG-130M. This dataset comprises 130 million text-video pairs from the open-domain, ensuring high-definition, widescreen and watermark-free characters. A smaller-scale yet more meticulously cleaned subset further enhances the data quality, aiding models in achieving superior performance. Experimental quantitative and qualitative results demonstrate the superiority of our approach in terms of per-frame quality, temporal correlation, and text-video alignment, with clear margins.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"4 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143477342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Informative Scene Graph Generation via Debiasing 通过去偏生成信息场景图
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-24 DOI: 10.1007/s11263-025-02365-y
Lianli Gao, Xinyu Lyu, Yuyu Guo, Yuxuan Hu, Yuan-Fang Li, Lu Xu, Heng Tao Shen, Jingkuan Song
{"title":"Informative Scene Graph Generation via Debiasing","authors":"Lianli Gao, Xinyu Lyu, Yuyu Guo, Yuxuan Hu, Yuan-Fang Li, Lu Xu, Heng Tao Shen, Jingkuan Song","doi":"10.1007/s11263-025-02365-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02365-y","url":null,"abstract":"<p>Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object). Due to biases in data, current models tend to predict common predicates, <i>e</i>.<i>g</i>., “on” and “at”, instead of informative ones, <i>e</i>.<i>g</i>., “standing on” and “looking at”. This tendency results in the loss of precise information and overall performance. If a model only uses “stone on road” rather than “stone blocking road” to describe an image, it may be a grave misunderstanding. We argue that this phenomenon is caused by two imbalances: semantic space level imbalance and training sample level imbalance. For this problem, we propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting. It integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL), for these imbalances. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships. BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates. Benefiting from the model-agnostic process, our method can be easily applied to SGG models and outperforms Transformer by <span>(136.3%)</span>, <span>(119.5%)</span>, and <span>(122.6%)</span> on mR@20 at three SGG sub-tasks on the SGG-VG dataset. Our method is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"3 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143485948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DustNet++: Deep Learning-Based Visual Regression for Dust Density Estimation dustnet++:基于深度学习的粉尘密度估计视觉回归
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-24 DOI: 10.1007/s11263-025-02376-9
Andreas Michel, Martin Weinmann, Jannick Kuester, Faisal AlNasser, Tomas Gomez, Mark Falvey, Rainer Schmitz, Wolfgang Middelmann, Stefan Hinz
{"title":"DustNet++: Deep Learning-Based Visual Regression for Dust Density Estimation","authors":"Andreas Michel, Martin Weinmann, Jannick Kuester, Faisal AlNasser, Tomas Gomez, Mark Falvey, Rainer Schmitz, Wolfgang Middelmann, Stefan Hinz","doi":"10.1007/s11263-025-02376-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02376-9","url":null,"abstract":"<p>Detecting airborne dust in standard RGB images presents significant challenges. Nevertheless, the monitoring of airborne dust holds substantial potential benefits for climate protection, environmentally sustainable construction, scientific research, and various other fields. To develop an efficient and robust algorithm for airborne dust monitoring, several hurdles have to be addressed. Airborne dust can be opaque or translucent, exhibit considerable variation in density, and possess indistinct boundaries. Moreover, distinguishing dust from other atmospheric phenomena, such as fog or clouds, can be particularly challenging. To meet the demand for a high-performing and reliable method for monitoring airborne dust, we introduce DustNet++, a neural network designed for dust density estimation. DustNet++ leverages feature maps from multiple resolution scales and semantic levels through window and grid attention mechanisms to maintain a sparse, globally effective receptive field with linear complexity. To validate our approach, we benchmark the performance of DustNet++ against existing methods from the domains of crowd counting and monocular depth estimation using the Meteodata airborne dust dataset and the URDE binary dust segmentation dataset. Our findings demonstrate that DustNet++ surpasses comparative methodologies in terms of regression and localization capabilities.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"56 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143477343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks 基于因果连体网络的单幅图像离焦去模糊连续测试时间自适应
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-22 DOI: 10.1007/s11263-025-02363-0
Shuang Cui, Yi Li, Jiangmeng Li, Xiongxin Tang, Bing Su, Fanjiang Xu, Hui Xiong
{"title":"Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks","authors":"Shuang Cui, Yi Li, Jiangmeng Li, Xiongxin Tang, Bing Su, Fanjiang Xu, Hui Xiong","doi":"10.1007/s11263-025-02363-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02363-0","url":null,"abstract":"<p>Single image defocus deblurring (SIDD) aims to restore an all-in-focus image from a defocused one. Distribution shifts in defocused images generally lead to performance degradation of existing methods during out-of-distribution inferences. In this work, we gauge the intrinsic reason behind the performance degradation, which is identified as the heterogeneity of lens-specific point spread functions. Empirical evidence supports this finding, motivating us to employ a continual test-time adaptation (CTTA) paradigm for SIDD. However, traditional CTTA methods, which primarily rely on entropy minimization, cannot sufficiently explore task-dependent information for pixel-level regression tasks like SIDD. To address this issue, we propose a novel Siamese networks-based continual test-time adaptation framework, which adapts source models to continuously changing target domains only requiring unlabeled target data in an online manner. To further mitigate semantically erroneous textures introduced by source SIDD models under severe degradation, we revisit the learning paradigm through a structural causal model and propose <i>Causal Siamese networks</i> (CauSiam). Our method leverages large-scale pre-trained vision-language models to derive discriminative universal semantic priors and integrates these priors into Siamese networks, ensuring causal identifiability between blurry inputs and restored images. Extensive experiments demonstrate that CauSiam effectively improves the generalization performance of existing SIDD methods in continuously changing domains.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"61 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143473594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Convolutional Neural Network Enhanced Non-uniform Fast Fourier Transform for Undersampled MRI Reconstruction 深度卷积神经网络增强的非均匀快速傅里叶变换在MRI欠采样重建中的应用
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-22 DOI: 10.1007/s11263-025-02378-7
Yuze Li, Haikun Qi, Zhangxuan Hu, Haozhong Sun, Guangqi Li, Zhe Zhang, Yilong Liu, Hua Guo, Huijun Chen
{"title":"Deep Convolutional Neural Network Enhanced Non-uniform Fast Fourier Transform for Undersampled MRI Reconstruction","authors":"Yuze Li, Haikun Qi, Zhangxuan Hu, Haozhong Sun, Guangqi Li, Zhe Zhang, Yilong Liu, Hua Guo, Huijun Chen","doi":"10.1007/s11263-025-02378-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02378-7","url":null,"abstract":"<p>NUFFT is widely used in MRI reconstruction, offering a balance of efficiency and accuracy. However, it struggles with uneven or sparse sampling, leading to unacceptable under sampling errors. To address this, we introduced DCNUFFT, a novel method that enhances NUFFT with deep convolutional neural network. The interpolation kernel and density compensation in inverse NUFFT were replaced with trainable neural network layers and incorporated a new global correlation prior in the spatial-frequency domain to better recover high-frequency information, enhancing reconstruction quality. DCNUFFT outperformed inverse NUFFT, iterative methods, and other deep learning approaches in terms of normalized root mean square error (NRMSE) and structural similarity index (SSIM) across various anatomies and sampling trajectories. Importantly, DCNUFFT also excelled in reconstructing under sampled PET and CT data, showing strong generalization capabilities. In subjective evaluations by radiologists, DCNUFFT scored highest in visual quality (VQ) and lesion distinguishing ability (LD), highlighting its clinical potential.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"23 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143473595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image Matting and 3D Reconstruction in One Loop 图像抠图和3D重建在一个循环
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-21 DOI: 10.1007/s11263-024-02341-y
Xinshuang Liu, Siqi Li, Yue Gao
{"title":"Image Matting and 3D Reconstruction in One Loop","authors":"Xinshuang Liu, Siqi Li, Yue Gao","doi":"10.1007/s11263-024-02341-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02341-y","url":null,"abstract":"<p>Recent 3D object reconstruction methods rely on user-input alpha mattes to remove the background and reconstruct the object, because automatically predicted alpha mattes are not accurate enough. To realize automatic 3D object reconstruction, we propose a <u>Joint</u> framework for image <u>M</u>atting and 3D object <u>R</u>econstruction (JointMR). It iteratively integrates information from all images into object hint maps to help image matting models predict better alpha mattes for each image and, in turn, improves 3D object reconstruction performance. The convergence of our framework is theoretically guaranteed. We further propose a method to convert an arbitrary image matting model into its hint-based counterpart. We conduct experiments on 3D object reconstruction from multi-view images and 3D dynamic object reconstruction from monocular videos. Different combinations of 3D object reconstruction models and image matting models are also tested. Experimental results show that our framework only slightly increases the computation cost but significantly improves the performance of all model combinations, demonstrating its compatibility and efficiency. Our code, models, and data are available at https://github.com/XinshuangL/JointMR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"50 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143462497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bootstrapping Vision-Language Models for Frequency-Centric Self-Supervised Remote Physiological Measurement 以频率为中心的自监督远程生理测量的自举视觉语言模型
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-21 DOI: 10.1007/s11263-025-02388-5
Zijie Yue, Miaojing Shi, Hanli Wang, Shuai Ding, Qijun Chen, Shanlin Yang
{"title":"Bootstrapping Vision-Language Models for Frequency-Centric Self-Supervised Remote Physiological Measurement","authors":"Zijie Yue, Miaojing Shi, Hanli Wang, Shuai Ding, Qijun Chen, Shanlin Yang","doi":"10.1007/s11263-025-02388-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02388-5","url":null,"abstract":"<p>Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel frequency-centric self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of frequency-related generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods. Our codes will be available at https://github.com/yuezijie/Bootstrapping-VLM-for-Frequency-centric-Self-supervised-Remote-Physiological-Measurement.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143462498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Instance-Level Moving Object Segmentation from a Single Image with Events 从带有事件的单个图像中进行实例级移动对象分割
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-20 DOI: 10.1007/s11263-025-02380-z
Zhexiong Wan, Bin Fan, Le Hui, Yuchao Dai, Gim Hee Lee
{"title":"Instance-Level Moving Object Segmentation from a Single Image with Events","authors":"Zhexiong Wan, Bin Fan, Le Hui, Yuchao Dai, Gim Hee Lee","doi":"10.1007/s11263-025-02380-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02380-z","url":null,"abstract":"<p>Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects, while the difficulties lie in taking into account both spatial texture structures and temporal motion cues. Existing methods based on video frames encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion due to the complexities of accurate image-based motion modeling. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images’ inadequate motion modeling capabilities, but instead lead to challenges in segmenting pixel-level object masks due to the lack of dense texture structures in events. To address these two limitations imposed by unimodal settings, we propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues. Our model incorporates implicit cross-modal masked attention augmentation, explicit contrastive feature learning, and flow-guided motion enhancement to exploit dense texture information from a single image and rich motion information from events, respectively. By leveraging the augmented texture and motion features, we separate mask segmentation from motion classification to handle varying numbers of independently moving objects. Through extensive evaluations on multiple datasets, as well as ablation experiments with different input settings and real-time efficiency analysis of the proposed framework, we believe that our first attempt to incorporate image and event data for practical deployment can provide new insights for future work in event-based motion related works. The source code with model training and pre-trained weights is released at https://npucvr.github.io/EvInsMOS.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"2 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143451612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models v - trojan:针对自回归视觉语言模型的多模态指令后门攻击
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2025-02-19 DOI: 10.1007/s11263-025-02368-9
Jiawei Liang, Siyuan Liang, Aishan Liu, Xiaochun Cao
{"title":"VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models","authors":"Jiawei Liang, Siyuan Liang, Aishan Liu, Xiaochun Cao","doi":"10.1007/s11263-025-02368-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02368-9","url":null,"abstract":"<p>Autoregressive Visual Language Models (VLMs) demonstrate remarkable few-shot learning capabilities within a multimodal context. Recently, multimodal instruction tuning has emerged as a technique to further refine instruction-following abilities. However, we uncover the potential threat posed by backdoor attacks on autoregressive VLMs during instruction tuning. Adversaries can implant a backdoor by inserting poisoned samples with triggers embedded in instructions or images to datasets, enabling malicious manipulation of the victim model’s predictions with predefined triggers. However, the frozen visual encoder in autoregressive VLMs imposes constraints on learning conventional image triggers. Additionally, adversaries may lack access to the parameters and architectures of the victim model. To overcome these challenges, we introduce a multimodal instruction backdoor attack, namely VL-Trojan. Our approach facilitates image trigger learning through active reshaping of poisoned features and enhances black-box attack efficacy through an iterative character-level text trigger generation method. Our attack successfully induces target output during inference, significantly outperforming baselines (+15.68%) in ASR. Furthermore, our attack demonstrates robustness across various model scales, architectures and few-shot in-context reasoning scenarios. Our codes are available at https://github.com/JWLiang007/VL-Trojan.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"49 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143443340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信