Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Yonggang Wen
{"title":"ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning","authors":"Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Yonggang Wen","doi":"10.1007/s11263-025-02440-4","DOIUrl":"https://doi.org/10.1007/s11263-025-02440-4","url":null,"abstract":"<p>Recent advancements in multimodal fusion have witnessed the remarkable success of vision-language (VL) models, which excel in various multimodal applications such as image captioning and visual question answering. However, building VL models requires substantial hardware resources, where efficiency is restricted by two key factors: the extended input sequence of the language model with vision features demands more computational operations, and a large number of additional learnable parameters increase memory complexity. These challenges significantly restrict the broader applicability of such models. To bridge this gap, we propose ADEM-VL, an efficient vision-language method that tunes VL models based on pretrained large language models (LLMs) by adopting a parameter-free cross-attention mechanism for similarity measurements in multimodal fusion. This approach only requires embedding vision features into the language space, significantly reducing the number of trainable parameters and accelerating both training and inference speeds. To enhance representation learning in fusion module, we introduce an efficient multiscale feature generation scheme that requires only a single forward pass through the vision encoder. Moreover, we propose an adaptive fusion scheme that dynamically discards less relevant visual information for each text token based on its attention score. This ensures that the fusion process prioritizes the most pertinent visual features. With experiments on various tasks including visual question answering, image captioning, and instruction-following, we demonstrate that our framework outperforms existing approaches. Specifically, our method surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset, with reduced training and inference latency, demonstrating the superiority of our framework.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"34 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143901570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heng Liu, Guanghui Li, Mingqi Gao, Xiantong Zhen, Feng Zheng, Yang Wang
{"title":"Few-Shot Referring Video Single- and Multi-Object Segmentation Via Cross-Modal Affinity with Instance Sequence Matching","authors":"Heng Liu, Guanghui Li, Mingqi Gao, Xiantong Zhen, Feng Zheng, Yang Wang","doi":"10.1007/s11263-025-02444-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02444-0","url":null,"abstract":"<p>Referring Video Object Segmentation (RVOS) aims to segment specific objects in videos based on the provided natural language descriptions. As a new supervised visual learning task, achieving RVOS for a given scene requires a substantial amount of annotated data. However, only minimal annotations are usually available for new scenes in realistic scenarios. Another practical problem is that, apart from a single object, multiple objects of the same category coexist in the same scene. Both of these issues may significantly reduce the performance of existing RVOS methods in handling real-world applications. In this paper, we propose a simple yet effective model to address these issues by incorporating a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module facilitates the establishment of multi-modal affinity over a limited number of samples, allowing the rapid acquisition of new semantic information while fostering the model’s adaptability to diverse scenarios. Furthermore, we extend our FS-RVOS approach to multiple objects through a new instance sequence matching module over CMA, which filters out all object trajectories with similarity to language features that exceed a matching threshold, thereby achieving few-shot referring multi-object segmentation (FS-RVMOS). To foster research in this field, we establish a new dataset based on currently available datasets, which covers many scenarios in terms of single-object and multi-object data, hence effectively simulating real-world scenes. Extensive experiments and comparative analyses underscore the exceptional performance of our proposed FS-RVOS and FS-RVMOS methods. Our method consistently outperforms existing related approaches through practical performance evaluations and robustness studies, achieving optimal performance on metrics across diverse benchmark tests.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong-Bo Zhang, Wang-Kai Lin, Hang Su, Qing Lei, Jing-Hua Liu, Ji-Xiang Du
{"title":"Interaction Confidence Attention for Human–Object Interaction Detection","authors":"Hong-Bo Zhang, Wang-Kai Lin, Hang Su, Qing Lei, Jing-Hua Liu, Ji-Xiang Du","doi":"10.1007/s11263-025-02445-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02445-z","url":null,"abstract":"<p>In human–object interaction (HOI) detection task, ensuring that interactive pairs receive higher attention weights while reducing the weight of non-interaction pairs is imperative for enhancing HOI detection accuracy. Guiding attention learning is also a key aspect of existing transformer-based algorithms. To tackle this challenge, this study proposes a novel approach termed Interaction Confidence Score Learning Attention (ICSLA), which introduces weakening and augmentation operations into the original attention weight calculation and feature extraction processes. In ICSLA, feature learning is coupled with confidence score learning, simultaneously. Leveraging ICSLA, a new and universal decoder is devised, establishing a transformer-based one-stage HOI detection architecture. Experimental results demonstrate the effectiveness of the proposed method in improving HOI detection accuracy, offering valuable insights for further optimization of attention mechanisms.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"9 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143880835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Markus Marks, Manuel Knott, Neehar Kondapaneni, Elijah Cole, Thijs Defraeye, Fernando Perez-Cruz, Pietro Perona
{"title":"A Closer Look at Benchmarking Self-supervised Pre-training with Image Classification","authors":"Markus Marks, Manuel Knott, Neehar Kondapaneni, Elijah Cole, Thijs Defraeye, Fernando Perez-Cruz, Pietro Perona","doi":"10.1007/s11263-025-02402-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02402-w","url":null,"abstract":"<p>Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels. The model is forced to learn about the data’s inherent structure or context by solving a pretext task. With SSL, models can learn from abundant and cheap unlabeled data, significantly reducing the cost of training models where labels are expensive or inaccessible. In Computer Vision, SSL is widely used as pre-training followed by a downstream task, such as supervised transfer, few-shot learning on smaller labeled data sets, and/or unsupervised clustering. Unfortunately, it is infeasible to evaluate SSL methods on all possible downstream tasks and objectively measure the quality of the learned representation. Instead, SSL methods are evaluated using in-domain evaluation protocols, such as fine-tuning, linear probing, and k-nearest neighbors (kNN). However, it is not well understood how well these evaluation protocols estimate the representation quality of a pre-trained model for different downstream tasks under different conditions, such as dataset, metric, and model architecture. In this work, we study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types. Our study includes eleven common image datasets and 26 models that were pre-trained with different SSL methods or have different model backbones. We find that in-domain linear/kNN probing protocols are, on average, the best general predictors for out-of-domain performance. We further investigate the importance of batch normalization for the various protocols and evaluate how robust correlations are for different kinds of dataset domain shifts. In addition, we challenge assumptions about the relationship between discriminative and generative self-supervised methods, finding that most of their performance differences can be explained by changes to model backbones.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143878114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anke Tang, Li Shen, Yong Luo, Shiwei Liu, Han Hu, Bo Du, Dacheng Tao
{"title":"Data-Adaptive Weight-Ensembling for Multi-task Model Fusion","authors":"Anke Tang, Li Shen, Yong Luo, Shiwei Liu, Han Hu, Bo Du, Dacheng Tao","doi":"10.1007/s11263-025-02434-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02434-2","url":null,"abstract":"<p>Creating a multi-task model by merging models for distinct tasks has proven to be an economical and scalable approach. Recent research, like task arithmetic, demonstrates that a static solution for multi-task model fusion can be located within the vector space spanned by task vectors. However, the static nature of these methods limits their ability to adapt to the intricacies of individual instances, thereby hindering their performance in complex scenarios. To overcome this limitation, we propose a data-adaptive weight-ensembling approach that generates model weights in time. Specifically, we first feed the input samples into a hypernetwork to generate instance-specific weights for the primary model. Subsequently, we perform a functional call on the primary large model with the instance-specific weights. By generating model weights in time, the unified model gains increased flexibility and can resolve potential weight conflicts between tasks. Building upon this adaptability, our method necessitates solely the model checkpoints and unlabeled test samples using test-time adaptation training. We primarily conduct extensive experiments on vision Transformers and Flan-T5 models, demonstrating superior performance and satisfactory zero-shot transferability.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"7 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143872837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiahao Nie, Fei Xie, Sifan Zhou, Xueyi Zhou, Dong-Kyu Chae, Zhiwei He
{"title":"P2P: Part-to-Part Motion Cues Guide a Strong Tracking Framework for LiDAR Point Clouds","authors":"Jiahao Nie, Fei Xie, Sifan Zhou, Xueyi Zhou, Dong-Kyu Chae, Zhiwei He","doi":"10.1007/s11263-025-02430-6","DOIUrl":"https://doi.org/10.1007/s11263-025-02430-6","url":null,"abstract":"<p>3D single object tracking (SOT) methods based on appearance matching has long suffered from insufficient appearance information incurred by incomplete, textureless and semantically deficient LiDAR point clouds. While motion paradigm exploits motion cues instead of appearance matching for tracking, it incurs complex multi-stage processing and segmentation module. In this paper, we first provide in-depth explorations on motion paradigm, which proves that (<b>i</b>) it is feasible to directly infer target relative motion from point clouds across consecutive frames; (<b>ii</b>) fine-grained information comparison between consecutive point clouds facilitates target motion modeling. We thereby propose to perform part-to-part motion modeling for consecutive point clouds and introduce a novel tracking framework, termed <b>P2P</b>. The novel framework fuses each corresponding part information between consecutive point clouds, effectively exploring detailed information changes and thus modeling accurate target-related motion cues. Following this framework, we present P2P-point and P2P-voxel models, incorporating implicit and explicit part-to-part motion modeling by point- and voxel-based representation, respectively. Without bells and whistles, P2P-voxel sets a new state-of-the-art performance (<span>(sim )</span><b>89%</b>, <b>72%</b> and <b>63%</b> precision on KITTI, NuScenes and Waymo Open Dataset, respectively). Moreover, under the same point-based representation, P2P-point outperforms the previous motion tracker M<span>(^2)</span>Track by <b>3.3%</b> and <b>6.7%</b> on the KITTI and NuScenes, while running at a considerably high speed of <b>107 Fps</b> on a single RTX3090 GPU. The source code and pre-trained models are available at https://github.com/haooozi/P2P.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"28 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143853420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuhui Liu, Hong Li, Zhi Qiao, Yawen Huang, Xi Liu, Juan Zhang, Zhen Qian, Xiantong Zhen, Baochang Zhang
{"title":"D3T: Dual-Domain Diffusion Transformer in Triplanar Latent Space for 3D Incomplete-View CT Reconstruction","authors":"Xuhui Liu, Hong Li, Zhi Qiao, Yawen Huang, Xi Liu, Juan Zhang, Zhen Qian, Xiantong Zhen, Baochang Zhang","doi":"10.1007/s11263-025-02426-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02426-2","url":null,"abstract":"<p>Computed tomography (CT) is a cornerstone of clinical imaging, yet its accessibility in certain scenarios is constrained by radiation exposure concerns and operational limitations within surgical environments. CT reconstruction from incomplete views has attracted increasing research attention due to its great potential in medical applications. However, it is inherently an ill-posed problem, which, coupled with the complex, high-dimensional characteristics of 3D medical data, poses great challenges such as artifact mitigation, global incoherence, and high computational costs. To tackle those challenges, this paper introduces D3T, a new 3D conditional diffusion transformer that models 3D CT distributions in the low-dimensional 2D latent space for incomplete-view CT reconstruction. Our approach comprises two primary components: a triplanar vector quantized auto-encoder (TriVQAE) and a latent dual-domain diffusion transformer (LD3T). TriVQAE encodes high-resolution 3D CT images into compact 2D latent triplane codes which effectively factorize the intricate CT structures, further enabling compute-friendly diffusion model architecture design. Operating in the latent triplane space, LD3T significantly reduces the complexity of capturing the intricate structures in CT images. Its improved diffusion transformer architecture efficiently understands the global correlations across the three planes, ensuring high-fidelity 3D reconstructions. LD3T presents a new dual-domain conditional generation pipeline that incorporates both image and projection conditions, facilitating controllable reconstruction to produce 3D structures consistent with the given conditions. Moreover, LD3T introduces a new Dual-Space Consistency Loss that integrates image-level supervision beyond standard supervision in the latent space to enhance consistency in the 3D image space. Extensive experiments on four datasets with three inverse settings demonstrate the effectiveness of our proposal.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"74 4 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143836981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linfeng Tang, Qinglong Yan, Xinyu Xiang, Leyuan Fang, Jiayi Ma
{"title":"C2RF: Bridging Multi-modal Image Registration and Fusion via Commonality Mining and Contrastive Learning","authors":"Linfeng Tang, Qinglong Yan, Xinyu Xiang, Leyuan Fang, Jiayi Ma","doi":"10.1007/s11263-025-02427-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02427-1","url":null,"abstract":"<p>Existing image fusion methods are typically only applicable to strictly aligned source images, and they introduce undesirable artifacts when source images are misaligned, compromising visual perception and downstream applications. In this work, we propose a mutually promoting multi-modal image registration and fusion framework based on commonality mining and contrastive learning, named C2RF. We adaptively decompose multi-modal images into modality-invariant common features and modality-specific unique features. Effective disentanglement not only reduces the difficulty of cross-modal registration but also facilitates purposeful information aggregation. Moreover, C2RF incorporates fusion-based contrastive learning to explicitly model the requirements of fusion on registration, which breaks the dilemma that registration and fusion are independent of each other. The aligned and misaligned fusion results act as positive and negative samples to guide registration optimization. Particularly, negative samples generated with hard negative sample mining enable our fusion results away from artifacts. Extensive experiments demonstrate that C2RF outperforms other competitors in both multi-modal image registration and fusion, notably in bolstering the robustness of image fusion to misalignment. The source code has been released at https://github.com/QinglongYan-hub/C2RF.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"218 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143832514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingxin Huang, Dezhi Peng, Hongliang Li, Zhenghao Peng, Chongyu Liu, Dahua Lin, Yuliang Liu, Xiang Bai, Lianwen Jin
{"title":"SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting","authors":"Mingxin Huang, Dezhi Peng, Hongliang Li, Zhenghao Peng, Chongyu Liu, Dahua Lin, Yuliang Liu, Xiang Bai, Lianwen Jin","doi":"10.1007/s11263-025-02428-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02428-0","url":null,"abstract":"<p>End-to-end scene text spotting, which aims to read the text in natural images, has garnered significant attention in recent years. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter v2, which seeks to find a better synergy between text detection and recognition. Specifically, we enhance the relationship between two tasks using novel Recognition Conversion and Recognition Alignment modules. Recognition Conversion explicitly guides text localization through recognition loss, while Recognition Alignment dynamically extracts text features for recognition through the detection predictions. This simple yet effective design results in a concise framework that requires neither an additional rectification module nor character-level annotations for the arbitrarily-shaped text. Furthermore, the parameters of the detector are greatly reduced without performance degradation by introducing a Box Selection Schedule. Qualitative and quantitative experiments demonstrate that SwinTextSpotter v2 achieves state-of-the-art performance on various multilingual (English, Chinese, and Vietnamese) benchmarks. The code will be available at https://github.com/mxin262/SwinTextSpotterv2.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"90 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143836980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Survey of Representation Learning, Optimization Strategies, and Applications for Omnidirectional Vision","authors":"Hao Ai, Zidong Cao, Lin Wang","doi":"10.1007/s11263-025-02391-w","DOIUrl":"https://doi.org/10.1007/s11263-025-02391-w","url":null,"abstract":"<p>Omnidirectional image (ODI) data is captured with a field-of-view of <span>(360^circ times 180^circ )</span>, which is much wider than the pinhole cameras and captures richer surrounding environment details than the conventional perspective images. In recent years, the availability of customer-level <span>(360^circ )</span> cameras has made omnidirectional vision more popular, and the advance of deep learning (DL) has significantly sparked its research and applications. This paper presents a systematic and comprehensive review and analysis of the recent progress of DL for omnidirectional vision. It delineates the distinct challenges and complexities encountered in applying DL to omnidirectional images as opposed to traditional perspective imagery. Our work covers four main contents: (i) A thorough introduction to the principles of omnidirectional imaging and commonly explored projections of ODI; (ii) A methodical review of varied representation learning approaches tailored for ODI; (iii) An in-depth investigation of optimization strategies specific to omnidirectional vision; (iv) A structural and hierarchical taxonomy of the DL methods for the representative omnidirectional vision tasks, from visual enhancement (<i>e</i>.<i>g</i>., image generation and super-resolution) to 3D geometry and motion estimation (<i>e</i>.<i>g</i>., depth and optical flow estimation), alongside the discussions on emergent research directions; (v) An overview of cutting-edge applications (<i>e</i>.<i>g</i>., autonomous driving and virtual reality), coupled with a critical discussion on prevailing challenges and open questions, to trigger more research in the community.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"3 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143814180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}