Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, Li Zhang
{"title":"SeaFormer++: Squeeze-Enhanced Axial Transformer for Mobile Visual Recognition","authors":"Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, Li Zhang","doi":"10.1007/s11263-025-02345-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02345-2","url":null,"abstract":"<p>Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement renders these methods unsuitable on the mobile device. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K, Cityscapes Pascal Context and COCO-Stuff datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Furthermore, we incorporate a feature upsampling-based multi-resolution distillation technique, further reducing the inference latency of the proposed framework. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification and object detection problems, demonstrating the potential of serving as a versatile mobile-friendly backbone. Our code and models are made publicly available at https://github.com/fudan-zvg/SeaFormer.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"38 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143031203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo
{"title":"MoonShot: Towards Controllable Video Generation and Editing with Motion-Aware Multimodal Conditions","authors":"David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo","doi":"10.1007/s11263-025-02346-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02346-1","url":null,"abstract":"<p>Current video diffusion models (VDMs) mostly rely on text conditions, limiting control over video appearance and geometry. This study introduces a new model, MoonShot, conditioning on both image and text for enhanced control. It features the Multimodal Video Block (MVB), integrating the motion-aware dual cross-attention layer for precise appearance and motion alignment with provided prompts, and the spatiotemporal attention layer for large motion dynamics. It can also incorporate pre-trained Image ControlNet modules for geometry conditioning without extra video training. Experiments show our model significantly improves visual quality and motion fidelity, and its versatility allows for applications in personalized video generation, animation, and editing, making it a foundational tool for controllable video creation. More video results can be found here.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"38 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143026436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection","authors":"Rui Shao, Tianxing Wu, Liqiang Nie, Ziwei Liu","doi":"10.1007/s11263-024-02274-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02274-6","url":null,"abstract":"<p>Existing deepfake detection methods fail to generalize well to unseen or degraded samples, which can be attributed to the over-fitting of low-level forgery patterns. Here we argue that high-level semantics are also indispensable recipes for generalizable forgery detection. Recently, large pre-trained Vision Transformers (ViTs) have shown promising generalization capability. In this paper, we propose the first parameter-efficient tuning approach for deepfake detection, namely <b>DeepFake-Adapter</b>, to effectively and efficiently adapt the generalizable high-level semantics from large pre-trained ViTs to aid deepfake detection. Given large pre-trained models but limited deepfake data, DeepFake-Adapter introduces lightweight yet dedicated dual-level adapter modules to a ViT while keeping the model backbone frozen. Specifically, to guide the adaptation process to be aware of both global and local forgery cues of deepfake data, <b>1)</b> we not only insert <b>Globally-aware Bottleneck Adapters</b> in parallel to MLP layers of ViT, <b>2)</b> but also actively cross-attend <b>Locally-aware Spatial Adapters</b> with features from ViT. Unlike existing deepfake detection methods merely focusing on low-level forgery patterns, the forgery detection process of our model can be regularized by generalizable high-level semantics from a pre-trained ViT and adapted by global and local low-level forgeries of deepfake data. Extensive experiments on several standard deepfake detection benchmarks validate the effectiveness of our approach. Notably, DeepFake-Adapter demonstrates a convincing advantage under cross-dataset and cross-manipulation settings.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"49 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143026409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shijia Huang, Feng Li, Hao Zhang, Shilong Liu, Lei Zhang, Liwei Wang
{"title":"A Mutual Supervision Framework for Referring Expression Segmentation and Generation","authors":"Shijia Huang, Feng Li, Hao Zhang, Shilong Liu, Lei Zhang, Liwei Wang","doi":"10.1007/s11263-024-02325-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02325-y","url":null,"abstract":"<p>Reference Expression Segmentation (RES) and Reference Expression Generation (REG) are mutually inverse tasks that can be naturally jointly trained. Though recent work has explored such joint training, the mechanism of how RES and REG can benefit each other is still unclear. In this paper, we propose a mutual supervision framework that enables two tasks to improve each other. Our mutual supervision contains two directions. On the one hand, Disambiguation Supervision leverages the expression unambiguity measurement provided by RES to enhance the language generation of REG. On the other hand, Generation Supervision uses expressions automatically generated by REG to scale up the training of RES. Such mutual supervision effectively improves two tasks by solving their bottleneck problems. Extensive experiments show that our approach significantly outperforms all existing methods on REG and RES tasks under the same setting, and detailed ablation studies demonstrate the effectiveness of all components in our framework.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"11 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143020694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GL-MCM: Global and Local Maximum Concept Matching for Zero-Shot Out-of-Distribution Detection","authors":"Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa","doi":"10.1007/s11263-025-02356-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02356-z","url":null,"abstract":"<p>Zero-shot OOD detection is a task that detects OOD images during inference with only in-distribution (ID) class names. Existing methods assume ID images contain a single, centered object, and do not consider the more realistic multi-object scenarios, where both ID and OOD objects are present. To meet the needs of many users, the detection method must have the flexibility to adapt the type of ID images. To this end, we present Global-Local Maximum Concept Matching (GL-MCM), which incorporates local image scores as an auxiliary score to enhance the separability of global and local visual features. Due to the simple ensemble score function design, GL-MCM can control the type of ID images with a single weight parameter. Experiments on ImageNet and multi-object benchmarks demonstrate that GL-MCM outperforms baseline zero-shot methods and is comparable to fully supervised methods. Furthermore, GL-MCM offers strong flexibility in adjusting the target type of ID images. The code is available via https://github.com/AtsuMiyai/GL-MCM.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"103 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pre-trained Trojan Attacks for Visual Recognition","authors":"Aishan Liu, Xianglong Liu, Xinwei Zhang, Yisong Xiao, Yuguang Zhou, Siyuan Liang, Jiakai Wang, Xiaochun Cao, Dacheng Tao","doi":"10.1007/s11263-024-02333-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02333-y","url":null,"abstract":"<p>Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks. However, the presence of backdoors within PVMs poses significant threats. Unfortunately, existing studies primarily focus on backdooring PVMs for the classification task, neglecting potential inherited backdoors in downstream tasks such as detection and segmentation. In this paper, we propose the <i>Pre-trained Trojan</i> attack, which embeds backdoors into a PVM, enabling attacks across various downstream vision tasks. We highlight the challenges posed by <i>cross-task activation</i> and <i>shortcut connections</i> in successful backdoor attacks. To achieve effective trigger activation in diverse tasks, we stylize the backdoor trigger patterns with class-specific textures, enhancing the recognition of task-irrelevant low-level features associated with the target class in the trigger pattern. Moreover, we address the issue of shortcut connections by introducing a context-free learning pipeline for poison training. In this approach, triggers without contextual backgrounds are directly utilized as training data, diverging from the conventional use of clean images. Consequently, we establish a direct shortcut from the trigger to the target class, mitigating the shortcut connection issue. We conducted extensive experiments to thoroughly validate the effectiveness of our attacks on downstream detection and segmentation tasks. Additionally, we showcase the potential of our approach in more practical scenarios, including large vision models and 3D object detection in autonomous driving. This paper aims to raise awareness of the potential threats associated with applying PVMs in practical scenarios. Our codes are available at https://github.com/Veee9/Pre-trained-Trojan.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking Generalizability and Discriminability of Self-Supervised Learning from Evolutionary Game Theory Perspective","authors":"Jiangmeng Li, Zehua Zang, Qirui Ji, Chuxiong Sun, Wenwen Qiang, Junge Zhang, Changwen Zheng, Fuchun Sun, Hui Xiong","doi":"10.1007/s11263-024-02321-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02321-2","url":null,"abstract":"<p>Representations learned by self-supervised approaches are generally considered to possess sufficient <i>generalizability</i> and <i>discriminability</i>. However, we disclose a nontrivial <i>mutual-exclusion</i> relationship between these critical representation properties through an exploratory demonstration on self-supervised learning. State-of-the-art self-supervised methods tend to enhance either generalizability or discriminability but not both simultaneously. Thus, learning representations <i>jointly</i> possessing strong generalizability and discriminability presents a specific challenge for self-supervised learning. To this end, we revisit the learning paradigm of self-supervised learning from the perspective of evolutionary game theory (EGT) and outline the theoretical roadmap to achieve a desired trade-off between these representation properties. EGT performs well in analyzing the trade-off point in a <i>two-player</i> game by utilizing dynamic system modeling. However, the EGT analysis requires sufficient annotated data, which contradicts the principle of self-supervised learning, i.e., the EGT analysis cannot be conducted without the annotations of the specific target domain for self-supervised learning. Thus, to enhance the methodological generalization, we propose a novel self-supervised learning method that leverages advancements in reinforcement learning to jointly benefit from the general guidance of EGT and sequentially optimize the model to chase the consistent improvement of generalizability and discriminability for specific target domains during pre-training. On top of this, we provide a benchmark to evaluate the generalizability and discriminability of learned representations comprehensively. Theoretically, we establish that the proposed method tightens the generalization error upper bound of self-supervised learning. Empirically, our method achieves state-of-the-art performance on various benchmarks. Our implementation is available at https://github.com/ZangZehua/essl.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"28 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142989088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic
{"title":"Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation","authors":"Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic","doi":"10.1007/s11263-024-02320-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02320-3","url":null,"abstract":"<p>Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone to biases. Our work investigates learning semantic segmentation in urban scenes without any manual annotation. We propose a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras and LiDAR sensors, thus eliminating the need for manual labeling. Our contributions are as follows. First, we develop a novel approach for cross-modal unsupervised learning of semantic segmentation by leveraging synchronized LiDAR and image data. A crucial element of our method is the integration of an object proposal module that examines the LiDAR point cloud to generate proposals for spatially consistent objects. Second, we demonstrate that these 3D object proposals can be aligned with corresponding images and effectively grouped into semantically meaningful pseudo-classes. Third, we introduce a cross-modal distillation technique that utilizes image data partially annotated with the learnt pseudo-classes to train a transformer-based model for semantic image segmentation. Fourth, we demonstrate further significant improvements of our approach by extending the proposed model using a teacher-student distillation with an exponential moving average and incorporating soft targets from the teacher. We show the generalization capabilities of our method by testing on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving, and ACDC) without any fine-tuning. We present an in-depth experimental analysis of the proposed model including results when using another pre-training dataset, per-class and pixel accuracy results, confusion matrices, PCA visualization, k-NN evaluation, ablations of the number of clusters and LiDAR’s density, supervised finetuning as well as additional qualitative results and their analysis.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"42 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142986393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Jin, Yang Shen, Xinyang Zhao, Zhenyong Fu, Jian Yang
{"title":"UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation","authors":"Jian Jin, Yang Shen, Xinyang Zhao, Zhenyong Fu, Jian Yang","doi":"10.1007/s11263-024-02334-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02334-x","url":null,"abstract":"<p>The demand for assorted conditional edits on a single real image is becoming increasingly prevalent. We focus on two dominant editing tasks that respectively condition on image and text input, namely subject-driven editing and semantic editing. Previous studies typically tackle these two editing tasks separately, thereby demanding multiple editing processes to achieve versatile edits on a single image. However, fragmented and sequential editing processes not only require more user effort but also further degrade the editing quality. In this paper, we propose <span>UniCanvas</span>, an affordance-aware unified framework that can achieve high-quality parallel subject-driven and semantic editing on a single real image within one inference process. <span>UniCanvas</span> innovatively unifies the multimodal inputs of the editing task into the textual condition space using tailored customization strategies. Building upon the unified representations, we propose a novel inference pipeline that performs parallel editing by selectively blending and manipulating two collaborative text-to-image generative branches. Customization enables the editing process to harness the strong visual understanding and reasoning capability of pre-trained generative models for affordance perception, and a unified inference space further facilitates more effective affordance interaction and alignment for compelling editing. Extensive experiments on diverse real images demonstrate that <span>UniCanvas</span> exhibits powerful scene affordance perception in unified image editing, achieving seamless subject-driven editing and precise semantic editing for various target subjects and query prompts (https://jinjianrick.github.io/unicanvas/).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"59 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generalized Robotic Vision-Language Learning Model via Linguistic Foreground-Aware Contrast","authors":"Kangcheng Liu, Chaoqun Wang, Xiaodong Han, Yong-Jin Liu, Baoquan Chen","doi":"10.1007/s11263-024-02340-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02340-z","url":null,"abstract":"<p>Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast FAC++ framework to learn more effective point cloud representations in pre-training. FAC++ consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over-discrimination between 3D segments/objects and encourage grouped foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Our proposed approach enhances both the local coherence as well as the overall feature discrimination. Moreover, we have designed the linguistic foreground-aware regional point sampling to enhance more balanced foreground-aware learning, which is termed FAC++. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also show that FAC++ achieves superior knowledge transfer and data efficiency in various downstream 3D semantic segmentation, instance segmentation as well as object detection tasks. All codes, data, and models are available at: (https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}