Shijia Huang, Feng Li, Hao Zhang, Shilong Liu, Lei Zhang, Liwei Wang
{"title":"A Mutual Supervision Framework for Referring Expression Segmentation and Generation","authors":"Shijia Huang, Feng Li, Hao Zhang, Shilong Liu, Lei Zhang, Liwei Wang","doi":"10.1007/s11263-024-02325-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02325-y","url":null,"abstract":"<p>Reference Expression Segmentation (RES) and Reference Expression Generation (REG) are mutually inverse tasks that can be naturally jointly trained. Though recent work has explored such joint training, the mechanism of how RES and REG can benefit each other is still unclear. In this paper, we propose a mutual supervision framework that enables two tasks to improve each other. Our mutual supervision contains two directions. On the one hand, Disambiguation Supervision leverages the expression unambiguity measurement provided by RES to enhance the language generation of REG. On the other hand, Generation Supervision uses expressions automatically generated by REG to scale up the training of RES. Such mutual supervision effectively improves two tasks by solving their bottleneck problems. Extensive experiments show that our approach significantly outperforms all existing methods on REG and RES tasks under the same setting, and detailed ablation studies demonstrate the effectiveness of all components in our framework.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"11 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143020694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GL-MCM: Global and Local Maximum Concept Matching for Zero-Shot Out-of-Distribution Detection","authors":"Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa","doi":"10.1007/s11263-025-02356-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02356-z","url":null,"abstract":"<p>Zero-shot OOD detection is a task that detects OOD images during inference with only in-distribution (ID) class names. Existing methods assume ID images contain a single, centered object, and do not consider the more realistic multi-object scenarios, where both ID and OOD objects are present. To meet the needs of many users, the detection method must have the flexibility to adapt the type of ID images. To this end, we present Global-Local Maximum Concept Matching (GL-MCM), which incorporates local image scores as an auxiliary score to enhance the separability of global and local visual features. Due to the simple ensemble score function design, GL-MCM can control the type of ID images with a single weight parameter. Experiments on ImageNet and multi-object benchmarks demonstrate that GL-MCM outperforms baseline zero-shot methods and is comparable to fully supervised methods. Furthermore, GL-MCM offers strong flexibility in adjusting the target type of ID images. The code is available via https://github.com/AtsuMiyai/GL-MCM.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"103 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pre-trained Trojan Attacks for Visual Recognition","authors":"Aishan Liu, Xianglong Liu, Xinwei Zhang, Yisong Xiao, Yuguang Zhou, Siyuan Liang, Jiakai Wang, Xiaochun Cao, Dacheng Tao","doi":"10.1007/s11263-024-02333-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02333-y","url":null,"abstract":"<p>Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks. However, the presence of backdoors within PVMs poses significant threats. Unfortunately, existing studies primarily focus on backdooring PVMs for the classification task, neglecting potential inherited backdoors in downstream tasks such as detection and segmentation. In this paper, we propose the <i>Pre-trained Trojan</i> attack, which embeds backdoors into a PVM, enabling attacks across various downstream vision tasks. We highlight the challenges posed by <i>cross-task activation</i> and <i>shortcut connections</i> in successful backdoor attacks. To achieve effective trigger activation in diverse tasks, we stylize the backdoor trigger patterns with class-specific textures, enhancing the recognition of task-irrelevant low-level features associated with the target class in the trigger pattern. Moreover, we address the issue of shortcut connections by introducing a context-free learning pipeline for poison training. In this approach, triggers without contextual backgrounds are directly utilized as training data, diverging from the conventional use of clean images. Consequently, we establish a direct shortcut from the trigger to the target class, mitigating the shortcut connection issue. We conducted extensive experiments to thoroughly validate the effectiveness of our attacks on downstream detection and segmentation tasks. Additionally, we showcase the potential of our approach in more practical scenarios, including large vision models and 3D object detection in autonomous driving. This paper aims to raise awareness of the potential threats associated with applying PVMs in practical scenarios. Our codes are available at https://github.com/Veee9/Pre-trained-Trojan.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking Generalizability and Discriminability of Self-Supervised Learning from Evolutionary Game Theory Perspective","authors":"Jiangmeng Li, Zehua Zang, Qirui Ji, Chuxiong Sun, Wenwen Qiang, Junge Zhang, Changwen Zheng, Fuchun Sun, Hui Xiong","doi":"10.1007/s11263-024-02321-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02321-2","url":null,"abstract":"<p>Representations learned by self-supervised approaches are generally considered to possess sufficient <i>generalizability</i> and <i>discriminability</i>. However, we disclose a nontrivial <i>mutual-exclusion</i> relationship between these critical representation properties through an exploratory demonstration on self-supervised learning. State-of-the-art self-supervised methods tend to enhance either generalizability or discriminability but not both simultaneously. Thus, learning representations <i>jointly</i> possessing strong generalizability and discriminability presents a specific challenge for self-supervised learning. To this end, we revisit the learning paradigm of self-supervised learning from the perspective of evolutionary game theory (EGT) and outline the theoretical roadmap to achieve a desired trade-off between these representation properties. EGT performs well in analyzing the trade-off point in a <i>two-player</i> game by utilizing dynamic system modeling. However, the EGT analysis requires sufficient annotated data, which contradicts the principle of self-supervised learning, i.e., the EGT analysis cannot be conducted without the annotations of the specific target domain for self-supervised learning. Thus, to enhance the methodological generalization, we propose a novel self-supervised learning method that leverages advancements in reinforcement learning to jointly benefit from the general guidance of EGT and sequentially optimize the model to chase the consistent improvement of generalizability and discriminability for specific target domains during pre-training. On top of this, we provide a benchmark to evaluate the generalizability and discriminability of learned representations comprehensively. Theoretically, we establish that the proposed method tightens the generalization error upper bound of self-supervised learning. Empirically, our method achieves state-of-the-art performance on various benchmarks. Our implementation is available at https://github.com/ZangZehua/essl.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"28 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142989088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic
{"title":"Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation","authors":"Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic","doi":"10.1007/s11263-024-02320-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02320-3","url":null,"abstract":"<p>Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone to biases. Our work investigates learning semantic segmentation in urban scenes without any manual annotation. We propose a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras and LiDAR sensors, thus eliminating the need for manual labeling. Our contributions are as follows. First, we develop a novel approach for cross-modal unsupervised learning of semantic segmentation by leveraging synchronized LiDAR and image data. A crucial element of our method is the integration of an object proposal module that examines the LiDAR point cloud to generate proposals for spatially consistent objects. Second, we demonstrate that these 3D object proposals can be aligned with corresponding images and effectively grouped into semantically meaningful pseudo-classes. Third, we introduce a cross-modal distillation technique that utilizes image data partially annotated with the learnt pseudo-classes to train a transformer-based model for semantic image segmentation. Fourth, we demonstrate further significant improvements of our approach by extending the proposed model using a teacher-student distillation with an exponential moving average and incorporating soft targets from the teacher. We show the generalization capabilities of our method by testing on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving, and ACDC) without any fine-tuning. We present an in-depth experimental analysis of the proposed model including results when using another pre-training dataset, per-class and pixel accuracy results, confusion matrices, PCA visualization, k-NN evaluation, ablations of the number of clusters and LiDAR’s density, supervised finetuning as well as additional qualitative results and their analysis.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"42 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142986393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Jin, Yang Shen, Xinyang Zhao, Zhenyong Fu, Jian Yang
{"title":"UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation","authors":"Jian Jin, Yang Shen, Xinyang Zhao, Zhenyong Fu, Jian Yang","doi":"10.1007/s11263-024-02334-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02334-x","url":null,"abstract":"<p>The demand for assorted conditional edits on a single real image is becoming increasingly prevalent. We focus on two dominant editing tasks that respectively condition on image and text input, namely subject-driven editing and semantic editing. Previous studies typically tackle these two editing tasks separately, thereby demanding multiple editing processes to achieve versatile edits on a single image. However, fragmented and sequential editing processes not only require more user effort but also further degrade the editing quality. In this paper, we propose <span>UniCanvas</span>, an affordance-aware unified framework that can achieve high-quality parallel subject-driven and semantic editing on a single real image within one inference process. <span>UniCanvas</span> innovatively unifies the multimodal inputs of the editing task into the textual condition space using tailored customization strategies. Building upon the unified representations, we propose a novel inference pipeline that performs parallel editing by selectively blending and manipulating two collaborative text-to-image generative branches. Customization enables the editing process to harness the strong visual understanding and reasoning capability of pre-trained generative models for affordance perception, and a unified inference space further facilitates more effective affordance interaction and alignment for compelling editing. Extensive experiments on diverse real images demonstrate that <span>UniCanvas</span> exhibits powerful scene affordance perception in unified image editing, achieving seamless subject-driven editing and precise semantic editing for various target subjects and query prompts (https://jinjianrick.github.io/unicanvas/).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"59 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generalized Robotic Vision-Language Learning Model via Linguistic Foreground-Aware Contrast","authors":"Kangcheng Liu, Chaoqun Wang, Xiaodong Han, Yong-Jin Liu, Baoquan Chen","doi":"10.1007/s11263-024-02340-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02340-z","url":null,"abstract":"<p>Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast FAC++ framework to learn more effective point cloud representations in pre-training. FAC++ consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over-discrimination between 3D segments/objects and encourage grouped foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Our proposed approach enhances both the local coherence as well as the overall feature discrimination. Moreover, we have designed the linguistic foreground-aware regional point sampling to enhance more balanced foreground-aware learning, which is termed FAC++. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also show that FAC++ achieves superior knowledge transfer and data efficiency in various downstream 3D semantic segmentation, instance segmentation as well as object detection tasks. All codes, data, and models are available at: (https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142974530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo
{"title":"RIGID: Recurrent GAN Inversion and Editing of Real Face Videos and Beyond","authors":"Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo","doi":"10.1007/s11263-024-02329-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02329-8","url":null,"abstract":"<p>GAN inversion is essential for harnessing the editability of GANs in real images, yet existing methods that invert video frames individually often yield temporally inconsistent results. To address this issue, we present a unified recurrent framework, <b>R</b>ecurrent v<b>I</b>deo <b>G</b>AN <b>I</b>nversion and e<b>D</b>iting (RIGID), designed to enforce temporally coherent GAN inversion and facial editing in real videos explicitly and simultaneously. Our approach models temporal relations between current and previous frames in three ways: (1) by maximizing inversion fidelity and consistency through learning a temporally compensated latent code and spatial features, (2) by disentangling high-frequency incoherent noises from the latent space, and (3) by introducing an in-between frame composition constraint to eliminate inconsistency after attribute manipulation, ensuring that each frame is a direct composite of its neighbors. Compared to existing video- and attribute-specific works, RIGID eliminates the need for expensive re-training of the model, resulting in approximately 60<span>(times )</span> faster performance. Furthermore, RIGID can be easily extended to other face domains, showcasing its versatility and adaptability. Extensive experiments demonstrate that RIGID outperforms state-of-the-art methods in inversion and editing tasks both qualitatively and quantitatively.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"121 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142967874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Meshing from Delaunay Triangulation for 3D Shape Representation","authors":"Chen Zhang, Wenbing Tao","doi":"10.1007/s11263-024-02344-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02344-9","url":null,"abstract":"<p>Recently, there has been a growing interest in learning-based explicit methods due to their ability to respect the original input and preserve details. However, the connectivity on complex structures is still difficult to infer due to the limited local shape perception, resulting in artifacts and non-watertight triangles. In this paper, we present a novel learning-based method with Delaunay triangulation to achieve high-precision reconstruction. We model the Delaunay triangulation as a dual graph, extract multi-scale geometric information from the points, and embed it into the structural representation of Delaunay triangulation in an organic way, benefiting fine-grained details reconstruction. To encourage neighborhood information interaction of edges and nodes in the graph, we introduce a Local Graph Iteration algorithm, serving as a variant of graph neural network. Benefiting from its robust local processing for dual graph, a scaling strategy is designed to enable large-scale reconstruction. Moreover, due to the complicated spatial relations between tetrahedrons and the ground truth surface, it is hard to directly generate ground truth labels of tetrahedrons for supervision. Therefore, we propose a multi-label supervision strategy, which is integrated in the loss we design for this task and allows our method to obtain robust labeling without visibility information. Experiments show that our method yields watertight and high-quality meshes. Especially for some thin structures and sharp edges, our method shows better performance than the current state-of-the-art methods. Furthermore, it has a strong adaptability to point clouds of different densities.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142939983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models","authors":"Angus Fung, Beno Benhabib, Goldie Nejat","doi":"10.1007/s11263-024-02336-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02336-9","url":null,"abstract":"<p>Tracking of dynamic people in cluttered and crowded human-centered environments is a challenging robotics problem due to the presence of intraclass variations including occlusions, pose deformations, and lighting variations. This paper introduces a novel deep learning architecture, using conditional latent diffusion models, the Latent Diffusion Track (<i>LDTrack</i>), for tracking multiple dynamic people under intraclass variations. By uniquely utilizing conditional latent diffusion models to capture temporal person embeddings, our architecture can adapt to appearance changes of people over time. We incorporated a latent feature encoder network which enables the diffusion process to operate within a high-dimensional latent space to allow for the extraction and spatial–temporal refinement of such rich features as person appearance, motion, location, identity, and contextual information. Extensive experiments demonstrate the effectiveness of <i>LDTrack</i> over other state-of-the-art tracking methods in cluttered and crowded human-centered environments under intraclass variations. Namely, the results show our method outperforms existing deep learning robotic people tracking methods in both tracking accuracy and tracking precision with statistical significance. Additionally, a comprehensive multi-object tracking comparison study was performed against the state-of-the-art methods in urban environments, demonstrating the generalizability of <i>LDTrack</i>. An ablation study was performed to validate the design choices of <i>LDTrack</i>.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142937023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}