Shuang Cui, Yi Li, Jiangmeng Li, Xiongxin Tang, Bing Su, Fanjiang Xu, Hui Xiong
{"title":"Continual Test-Time Adaptation for Single Image Defocus Deblurring via Causal Siamese Networks","authors":"Shuang Cui, Yi Li, Jiangmeng Li, Xiongxin Tang, Bing Su, Fanjiang Xu, Hui Xiong","doi":"10.1007/s11263-025-02363-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02363-0","url":null,"abstract":"<p>Single image defocus deblurring (SIDD) aims to restore an all-in-focus image from a defocused one. Distribution shifts in defocused images generally lead to performance degradation of existing methods during out-of-distribution inferences. In this work, we gauge the intrinsic reason behind the performance degradation, which is identified as the heterogeneity of lens-specific point spread functions. Empirical evidence supports this finding, motivating us to employ a continual test-time adaptation (CTTA) paradigm for SIDD. However, traditional CTTA methods, which primarily rely on entropy minimization, cannot sufficiently explore task-dependent information for pixel-level regression tasks like SIDD. To address this issue, we propose a novel Siamese networks-based continual test-time adaptation framework, which adapts source models to continuously changing target domains only requiring unlabeled target data in an online manner. To further mitigate semantically erroneous textures introduced by source SIDD models under severe degradation, we revisit the learning paradigm through a structural causal model and propose <i>Causal Siamese networks</i> (CauSiam). Our method leverages large-scale pre-trained vision-language models to derive discriminative universal semantic priors and integrates these priors into Siamese networks, ensuring causal identifiability between blurry inputs and restored images. Extensive experiments demonstrate that CauSiam effectively improves the generalization performance of existing SIDD methods in continuously changing domains.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"61 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143473594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep Convolutional Neural Network Enhanced Non-uniform Fast Fourier Transform for Undersampled MRI Reconstruction","authors":"Yuze Li, Haikun Qi, Zhangxuan Hu, Haozhong Sun, Guangqi Li, Zhe Zhang, Yilong Liu, Hua Guo, Huijun Chen","doi":"10.1007/s11263-025-02378-7","DOIUrl":"https://doi.org/10.1007/s11263-025-02378-7","url":null,"abstract":"<p>NUFFT is widely used in MRI reconstruction, offering a balance of efficiency and accuracy. However, it struggles with uneven or sparse sampling, leading to unacceptable under sampling errors. To address this, we introduced DCNUFFT, a novel method that enhances NUFFT with deep convolutional neural network. The interpolation kernel and density compensation in inverse NUFFT were replaced with trainable neural network layers and incorporated a new global correlation prior in the spatial-frequency domain to better recover high-frequency information, enhancing reconstruction quality. DCNUFFT outperformed inverse NUFFT, iterative methods, and other deep learning approaches in terms of normalized root mean square error (NRMSE) and structural similarity index (SSIM) across various anatomies and sampling trajectories. Importantly, DCNUFFT also excelled in reconstructing under sampled PET and CT data, showing strong generalization capabilities. In subjective evaluations by radiologists, DCNUFFT scored highest in visual quality (VQ) and lesion distinguishing ability (LD), highlighting its clinical potential.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"23 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143473595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Image Matting and 3D Reconstruction in One Loop","authors":"Xinshuang Liu, Siqi Li, Yue Gao","doi":"10.1007/s11263-024-02341-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02341-y","url":null,"abstract":"<p>Recent 3D object reconstruction methods rely on user-input alpha mattes to remove the background and reconstruct the object, because automatically predicted alpha mattes are not accurate enough. To realize automatic 3D object reconstruction, we propose a <u>Joint</u> framework for image <u>M</u>atting and 3D object <u>R</u>econstruction (JointMR). It iteratively integrates information from all images into object hint maps to help image matting models predict better alpha mattes for each image and, in turn, improves 3D object reconstruction performance. The convergence of our framework is theoretically guaranteed. We further propose a method to convert an arbitrary image matting model into its hint-based counterpart. We conduct experiments on 3D object reconstruction from multi-view images and 3D dynamic object reconstruction from monocular videos. Different combinations of 3D object reconstruction models and image matting models are also tested. Experimental results show that our framework only slightly increases the computation cost but significantly improves the performance of all model combinations, demonstrating its compatibility and efficiency. Our code, models, and data are available at https://github.com/XinshuangL/JointMR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"50 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143462497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bootstrapping Vision-Language Models for Frequency-Centric Self-Supervised Remote Physiological Measurement","authors":"Zijie Yue, Miaojing Shi, Hanli Wang, Shuai Ding, Qijun Chen, Shanlin Yang","doi":"10.1007/s11263-025-02388-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02388-5","url":null,"abstract":"<p>Facial video-based remote physiological measurement is a promising research area for detecting human vital signs (e.g., heart rate, respiration frequency) in a non-contact way. Conventional approaches are mostly supervised learning, requiring extensive collections of facial videos and synchronously recorded photoplethysmography (PPG) signals. To tackle it, self-supervised learning has recently gained attentions; due to the lack of ground truth PPG signals, its performance is however limited. In this paper, we propose a novel frequency-centric self-supervised framework that successfully integrates the popular vision-language models (VLMs) into the remote physiological measurement task. Given a facial video, we first augment its positive and negative video samples with varying rPPG signal frequencies. Next, we introduce a frequency-oriented vision-text pair generation method by carefully creating contrastive spatio-temporal maps from positive and negative samples and designing proper text prompts to describe their relative ratios of signal frequencies. A pre-trained VLM is employed to extract features for these formed vision-text pairs and estimate rPPG signals thereafter. We develop a series of frequency-related generative and contrastive learning mechanisms to optimize the VLM, including the text-guided visual reconstruction task, the vision-text contrastive learning task, and the frequency contrastive and ranking task. Overall, our method for the first time adapts VLMs to digest and align the frequency-related knowledge in vision and text modalities. Extensive experiments on four benchmark datasets demonstrate that it significantly outperforms state of the art self-supervised methods. Our codes will be available at https://github.com/yuezijie/Bootstrapping-VLM-for-Frequency-centric-Self-supervised-Remote-Physiological-Measurement.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143462498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Imbuing, Enrichment and Calibration: Leveraging Language for Unseen Domain Extension","authors":"Chenyi Jiang, Jianqin Zhao, Jingjing Deng, Zechao Li, Haofeng Zhang","doi":"10.1007/s11263-025-02382-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02382-x","url":null,"abstract":"<p>The incorporation of language to enable model extension into unseen domains has gained significant interest in recent years. Previous methods commonly utilize semantically guided distributional shifts in training features to achieve this. Nevertheless, the intrinsic modal disparities between language and pixel-level images frequently result in a divergence within the feature manifold when employing semantic guidelines to augment features. This paper presents the <i>IMbuing, Enrichment, and Calibration (IMEC)</i> strategy as a concise solution for these issues. Unlike previous approaches, IMEC reverses the target domain style mining process to ensure the retention of semantic content within a more structured framework. Guided by global semantics, we conditionally generate style vectors for imbuing into visual features. After which IMEC introduces minor perturbations to disperse these vectors using local semantics and selectively calibrates semantic content in features through a dimensional activation strategy. IMEC integrates semantic abstract knowledge with detail image content, bridging the gap between synthetic and real samples in the target domain and mitigating content collapse resulting from semantic-visual disparities. Our model is evaluated on semantic segmentation, object detection, and image classification tasks across challenging datasets, demonstrating superior performance over existing methods in both the target and source domains. The code for IMEC is available at https://github.com/LanchJL/IMEC-ZSDE.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"65 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143462495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Consistent Prompt Tuning for Generalized Category Discovery","authors":"Muli Yang, Jie Yin, Yanan Gu, Cheng Deng, Hanwang Zhang, Hongyuan Zhu","doi":"10.1007/s11263-024-02343-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02343-w","url":null,"abstract":"<p>Generalized Category Discovery (GCD) aims at discovering both known and unknown classes in unlabeled data, using the knowledge learned from a limited set of labeled data. Despite today’s foundation models being trained with Internet-scale multi-modal corpus, we find that they still struggle in GCD due to the ambiguity in class definitions. In this paper, we present Consistent Prompt Tuning (CPT) to disambiguate the classes for large vision-language models (<i>e</i>.<i>g</i>., CLIP). To this end, CPT learns a set of “task + class” prompts for labeled and unlabeled data of both known and unknown classes, with the “task” tokens globally shared across classes, which contain a unified class definition pattern, <i>e</i>.<i>g</i>., “the foreground is an animal named” or “the background scene is”. These prompts are optimized with two efficient regularization techniques that encourage consistent global and local relationships between any two matched inputs. CPT is evaluated on various existing GCD benchmarks, as well as in new practical scenarios with fewer annotations and customized class definitions, demonstrating clear superiority and broad versatility over existing state-of-the-art methods.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143462496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhexiong Wan, Bin Fan, Le Hui, Yuchao Dai, Gim Hee Lee
{"title":"Instance-Level Moving Object Segmentation from a Single Image with Events","authors":"Zhexiong Wan, Bin Fan, Le Hui, Yuchao Dai, Gim Hee Lee","doi":"10.1007/s11263-025-02380-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02380-z","url":null,"abstract":"<p>Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects, while the difficulties lie in taking into account both spatial texture structures and temporal motion cues. Existing methods based on video frames encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion due to the complexities of accurate image-based motion modeling. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images’ inadequate motion modeling capabilities, but instead lead to challenges in segmenting pixel-level object masks due to the lack of dense texture structures in events. To address these two limitations imposed by unimodal settings, we propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues. Our model incorporates implicit cross-modal masked attention augmentation, explicit contrastive feature learning, and flow-guided motion enhancement to exploit dense texture information from a single image and rich motion information from events, respectively. By leveraging the augmented texture and motion features, we separate mask segmentation from motion classification to handle varying numbers of independently moving objects. Through extensive evaluations on multiple datasets, as well as ablation experiments with different input settings and real-time efficiency analysis of the proposed framework, we believe that our first attempt to incorporate image and event data for practical deployment can provide new insights for future work in event-based motion related works. The source code with model training and pre-trained weights is released at https://npucvr.github.io/EvInsMOS.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"2 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143451612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiawei Liang, Siyuan Liang, Aishan Liu, Xiaochun Cao
{"title":"VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models","authors":"Jiawei Liang, Siyuan Liang, Aishan Liu, Xiaochun Cao","doi":"10.1007/s11263-025-02368-9","DOIUrl":"https://doi.org/10.1007/s11263-025-02368-9","url":null,"abstract":"<p>Autoregressive Visual Language Models (VLMs) demonstrate remarkable few-shot learning capabilities within a multimodal context. Recently, multimodal instruction tuning has emerged as a technique to further refine instruction-following abilities. However, we uncover the potential threat posed by backdoor attacks on autoregressive VLMs during instruction tuning. Adversaries can implant a backdoor by inserting poisoned samples with triggers embedded in instructions or images to datasets, enabling malicious manipulation of the victim model’s predictions with predefined triggers. However, the frozen visual encoder in autoregressive VLMs imposes constraints on learning conventional image triggers. Additionally, adversaries may lack access to the parameters and architectures of the victim model. To overcome these challenges, we introduce a multimodal instruction backdoor attack, namely VL-Trojan. Our approach facilitates image trigger learning through active reshaping of poisoned features and enhances black-box attack efficacy through an iterative character-level text trigger generation method. Our attack successfully induces target output during inference, significantly outperforming baselines (+15.68%) in ASR. Furthermore, our attack demonstrates robustness across various model scales, architectures and few-shot in-context reasoning scenarios. Our codes are available at https://github.com/JWLiang007/VL-Trojan.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"49 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143443340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VideoQA in the Era of LLMs: An Empirical Study","authors":"Junbin Xiao, Nanxin Huang, Hangyu Qin, Dongyang Li, Yicong Li, Fengbin Zhu, Zhulin Tao, Jianxing Yu, Liang Lin, Tat-Seng Chua, Angela Yao","doi":"10.1007/s11263-025-02385-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02385-8","url":null,"abstract":"<p>Video Large Language Models (Video-LLMs) are flourishing and has advanced many video-language tasks. As a golden testbed, Video Question Answering (VideoQA) plays pivotal role in Video-LLM developing. This work conducts a timely and comprehensive study of Video-LLMs’ behavior in VideoQA, aiming to elucidate their success and failure modes, and provide insights towards more human-like video understanding and question answering. Our analyses demonstrate that Video-LLMs excel in VideoQA; they can correlate contextual cues and generate plausible responses to questions about varied video contents. However, models falter in handling video temporality, both in reasoning about temporal content ordering and grounding QA-relevant temporal moments. Moreover, the models behave unintuitively - they are unresponsive to adversarial video perturbations while being sensitive to simple variations of candidate answers and questions. Also, they do not necessarily generalize better. The findings demonstrate Video-LLMs’ QA capability in standard condition yet highlight their severe deficiency in robustness and interpretability, suggesting the urgent need on rationales in Video-LLM developing.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143443339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diagnosing Human-Object Interaction Detectors","authors":"Fangrui Zhu, Yiming Xie, Weidi Xie, Huaizu Jiang","doi":"10.1007/s11263-025-02369-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02369-8","url":null,"abstract":"<p>We have witnessed significant progress in human-object interaction (HOI) detection. However, relying solely on <i>mAP</i> (mean Average Precision) scores as a summary metric does not provide sufficient insight into the nuances of model performance (<i>e.g.</i>, why one model outperforms another), which can hinder further innovation in this field. To address this issue, we introduce a diagnosis toolbox in this paper to offer a detailed quantitative breakdown of HOI detection models, inspired by the success of object detection diagnosis tools. We first conduct a holistic investigation into the HOI detection pipeline. By defining a set of errors and using oracles to fix each one, we quantitatively analyze the significance of different errors based on the <i>mAP</i> improvement gained from fixing them. Next, we explore the two key sub-tasks of HOI detection: human-object pair localization and interaction classification. For the pair localization task, we compute the coverage of ground-truth human-object pairs and assess the noisiness of the localization results. For the classification task, we measure a model’s ability to distinguish between positive and negative detection results and to classify actual interactions when human-object pairs are correctly localized. We analyze eight state-of-the-art HOI detection models, providing valuable diagnostic insights to guide future research. For instance, our diagnosis reveals that the state-of-the-art model RLIPv2 outperforms others primarily due to its significant improvement in multi-label interaction classification accuracy. Our toolbox is applicable across various methods and datasets and is available at https://neu-vi.github.io/Diag-HOI/.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"2 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143427266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}