{"title":"One-Shot Generative Domain Adaptation in 3D GANs","authors":"Ziqiang Li, Yi Wu, Chaoyue Wang, Xue Rui, Bin Li","doi":"10.1007/s11263-024-02268-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02268-4","url":null,"abstract":"<p>3D-aware image generation necessitates extensive training data to ensure stable training and mitigate the risk of overfitting. This paper first consider a novel task known as One-shot 3D Generative Domain Adaptation (GDA), aimed at transferring a pre-trained 3D generator from one domain to a new one, relying solely on a single reference image. One-shot 3D GDA is characterized by the pursuit of specific attributes, namely, <i>high fidelity</i>, <i>large diversity</i>, <i>cross-domain consistency</i>, and <i>multi-view consistency</i>. Within this paper, we introduce 3D-Adapter, the first one-shot 3D GDA method, for diverse and faithful generation. Our approach begins by judiciously selecting a restricted weight set for fine-tuning, and subsequently leverages four advanced loss functions to facilitate adaptation. An efficient progressive fine-tuning strategy is also implemented to enhance the adaptation process. The synergy of these three technological components empowers 3D-Adapter to achieve remarkable performance, substantiated both quantitatively and qualitatively, across all desired properties of 3D GDA. Furthermore, 3D-Adapter seamlessly extends its capabilities to zero-shot scenarios, and preserves the potential for crucial tasks such as interpolation, reconstruction, and editing within the latent space of the pre-trained generator. Code will be available at https://github.com/iceli1007/3D-Adapter.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"61 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142684360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcos Roberto e Souza, Helena de Almeida Maia, Helio Pedrini
{"title":"NAFT and SynthStab: A RAFT-Based Network and a Synthetic Dataset for Digital Video Stabilization","authors":"Marcos Roberto e Souza, Helena de Almeida Maia, Helio Pedrini","doi":"10.1007/s11263-024-02264-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02264-8","url":null,"abstract":"<p>Multiple deep learning-based stabilization methods have been proposed recently. Some of them directly predict the optical flow to warp each unstable frame into its stabilized version, which we called direct warping. These methods primarily perform online or semi-online stabilization, prioritizing lower computational cost while achieving satisfactory results in certain scenarios. However, they fail to smooth intense instabilities and have considerably inferior results in comparison to other approaches. To improve their quality and reduce this difference, we propose: (a) NAFT, a new direct warping semi-online stabilization method, which adapts RAFT to videos by including a neighborhood-aware update mechanism, called IUNO. By using our training approach along with IUNO, we can learn the characteristics that contribute to video stability from the data patterns, rather than requiring an explicit stability definition. Furthermore, we demonstrate how leveraging an off-the-shelf video inpainting method to achieve full-frame stabilization; (b) SynthStab, a new synthetic dataset consisting of paired videos that allows supervision by camera motion instead of pixel similarities. To build SynthStab, we modeled camera motion using kinematic concepts. In addition, the unstable motion respects scene constraints, such as depth variation. We performed several experiments on SynthStab to develop and validate NAFT. We compared our results with five other methods from the literature with publicly available code. Our experimental results show that we were able to stabilize intense camera motion, outperforming other direct warping methods and bringing its performance closer to state-of-the-art methods. In terms of computational resources, our smallest network has only about 7% of model size and trainable parameters than the smallest values among the competing methods.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142690533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin Xiao, Danyu Shi, Xiuli Bi, Weisheng Li, Xinbo Gao
{"title":"CS-CoLBP: Cross-Scale Co-occurrence Local Binary Pattern for Image Classification","authors":"Bin Xiao, Danyu Shi, Xiuli Bi, Weisheng Li, Xinbo Gao","doi":"10.1007/s11263-024-02297-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02297-z","url":null,"abstract":"<p>The local binary pattern (LBP) is an effective feature, describing the size relationship between the neighboring pixels and the current pixel. While individual LBP-based methods yield good results, co-occurrence LBP-based methods exhibit a better ability to extract structural information. However, most of the co-occurrence LBP-based methods excel mainly in dealing with rotated images, exhibiting limitations in preserving performance for scaled images. To address the issue, a cross-scale co-occurrence LBP (CS-CoLBP) is proposed. Initially, we construct an LBP co-occurrence space to capture robust structural features by simulating scale transformation. Subsequently, we use Cross-Scale Co-occurrence pairs (CS-Co pairs) to extract the structural features, keeping robust descriptions even in the presence of scaling. Finally, we refine these CS-Co pairs through Rotation Consistency Adjustment (RCA) to bolster their rotation invariance, thereby making the proposed CS-CoLBP as powerful as existing co-occurrence LBP-based methods for rotated image description. While keeping the desired geometric invariance, the proposed CS-CoLBP maintains a modest feature dimension. Empirical evaluations across several datasets demonstrate that CS-CoLBP outperforms the existing state-of-the-art LBP-based methods even in the presence of geometric transformations and image manipulations.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"53 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142673313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmet Burak Yildirim, Hamza Pehlivan, Aysegul Dundar
{"title":"Warping the Residuals for Image Editing with StyleGAN","authors":"Ahmet Burak Yildirim, Hamza Pehlivan, Aysegul Dundar","doi":"10.1007/s11263-024-02301-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02301-6","url":null,"abstract":"<p>StyleGAN models show editing capabilities via their semantically interpretable latent organizations which require successful GAN inversion methods to edit real images. Many works have been proposed for inverting images into StyleGAN’s latent space. However, their results either suffer from low fidelity to the input image or poor editing qualities, especially for edits that require large transformations. That is because low bit rate latent spaces lose many image details due to the information bottleneck even though it provides an editable space. On the other hand, higher bit rate latent spaces can pass all the image details to StyleGAN for perfect reconstruction of images but suffer from low editing qualities. In this work, we present a novel image inversion architecture that extracts high-rate latent features and includes a flow estimation module to warp these features to adapt them to edits. This is because edits often involve spatial changes in the image, such as adjustments to pose or smile. Thus, high-rate latent features must be accurately repositioned to match their new locations in the edited image space. We achieve this by employing flow estimation to determine the necessary spatial adjustments, followed by warping the features to align them correctly in the edited image. Specifically, we estimate the flows from StyleGAN features of edited and unedited latent codes. By estimating the high-rate features and warping them for edits, we achieve both high-fidelity to the input image and high-quality edits. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142670356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pulling Target to Source: A New Perspective on Domain Adaptive Semantic Segmentation","authors":"Haochen Wang, Yujun Shen, Jingjing Fei, Wei Li, Liwei Wu, Yuxi Wang, Zhaoxiang Zhang","doi":"10.1007/s11263-024-02285-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02285-3","url":null,"abstract":"<p>Domain-adaptive semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. However, existing methods primarily focus on directly learning categorically discriminative target features for segmenting target images, which is challenging in the absence of target labels. This work provides a new perspective. We ob serve that the features learned with source data manage to keep categorically discriminative during training, thereby enabling us to implicitly learn adequate target representations by simply <i>pulling target features close to source features for each category</i>. To this end, we propose T2S-DA, which encourages the model to learn similar cross-domain features. Also, considering the pixel categories are heavily imbalanced for segmentation datasets, we come up with a dynamic re-weighting strategy to help the model concentrate on those underperforming classes. Extensive experiments confirm that T2S-DA learns a more discriminative and generalizable representation, significantly surpassing the state-of-the-art. We further show that T2S-DA is quite qualified for the domain generalization task, verifying its domain-invariant property.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"99 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142642626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feature Matching via Graph Clustering with Local Affine Consensus","authors":"Yifan Lu, Jiayi Ma","doi":"10.1007/s11263-024-02291-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02291-5","url":null,"abstract":"<p>This paper studies graph clustering with application to feature matching and proposes an effective method, termed as GC-LAC, that can establish reliable feature correspondences and simultaneously discover all potential visual patterns. In particular, we regard each putative match as a node and encode the geometric relationships into edges where a visual pattern sharing similar motion behaviors corresponds to a strongly connected subgraph. In this setting, it is natural to formulate the feature matching task as a graph clustering problem. To construct a geometric meaningful graph, based on the best practices, we adopt a local affine strategy. By investigating the motion coherence prior, we further propose an efficient and deterministic geometric solver (MCDG) to extract the local geometric information that helps construct the graph. The graph is sparse and general for various image transformations. Subsequently, a novel robust graph clustering algorithm (D2SCAN) is introduced, which defines the notion of density-reachable on the graph by replicator dynamics optimization. Extensive experiments focusing on both the local and the whole of our GC-LAC with various practical vision tasks including relative pose estimation, homography and fundamental matrix estimation, loop-closure detection, and multimodel fitting, demonstrate that our GC-LAC is more competitive than current state-of-the-art methods, in terms of generality, efficiency, and effectiveness. The source code for this work is publicly available at: https://github.com/YifanLu2000/GCLAC.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"75 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142637263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Garvita Allabadi, Ana Lucic, Yu-Xiong Wang, Vikram Adve
{"title":"Learning to Detect Novel Species with SAM in the Wild","authors":"Garvita Allabadi, Ana Lucic, Yu-Xiong Wang, Vikram Adve","doi":"10.1007/s11263-024-02234-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02234-0","url":null,"abstract":"<p>This paper tackles the limitation of a closed-world object detection model that was trained on one species. The expectation for this model is that it will not generalize well to recognize the instances of new species if they were present in the incoming data stream. We propose a novel object detection framework for this open-world setting that is suitable for applications that monitor wildlife, ocean life, livestock, plant phenotype and crops that typically feature one species in the image. Our method leverages labeled samples from one species in combination with a novelty detection method and Segment Anything Model, a vision foundation model, to (1) identify the presence of new species in unlabeled images, (2) localize their instances, and (3) <i>retrain</i> the initial model with the localized novel class instances. The resulting integrated system <i>assimilates</i> and <i>learns</i> from unlabeled samples of the new classes while not “forgetting” the original species the model was trained on. We demonstrate our findings on two different domains, (1) wildlife detection and (2) plant detection. Our method achieves an AP of 56.2 (for 4 novel species) to 61.6 (for 1 novel species) for wildlife domain, without relying on any ground truth data in the background.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"80 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142610210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdullah Hamdi, Faisal AlZahrani, Silvio Giancola, Bernard Ghanem
{"title":"MVTN: Learning Multi-view Transformations for 3D Understanding","authors":"Abdullah Hamdi, Faisal AlZahrani, Silvio Giancola, Bernard Ghanem","doi":"10.1007/s11263-024-02283-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02283-5","url":null,"abstract":"<p>Multi-view projection techniques have shown themselves to be highly effective in achieving top-performing results in the recognition of 3D shapes. These methods involve learning how to combine information from multiple view-points. However, the camera view-points from which these views are obtained are often fixed for all shapes. To overcome the static nature of current multi-view techniques, we propose learning these view-points. Specifically, we introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. As a result, MVTN can be trained end-to-end with any multi-view network for 3D shape classification. We integrate MVTN into a novel adaptive multi-view pipeline that is capable of rendering both 3D meshes and point clouds. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks (ModelNet40, ScanObjectNN, ShapeNet Core55). Further analysis indicates that our approach exhibits improved robustness to occlusion compared to other methods. We also investigate additional aspects of MVTN, such as 2D pretraining and its use for segmentation. To support further research in this area, we have released MVTorch, a PyTorch library for 3D understanding and generation using multi-view projections.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"38 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Middle Modality Alignment Learning for Visible-Infrared Person Re-identification","authors":"Yukang Zhang, Yan Yan, Yang Lu, Hanzi Wang","doi":"10.1007/s11263-024-02276-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02276-4","url":null,"abstract":"<p>Visible-infrared person re-identification (VIReID) has attracted increasing attention due to the requirements for 24-hour intelligent surveillance systems. In this task, one of the major challenges is the modality discrepancy between the visible (VIS) and infrared (NIR) images. Most conventional methods try to design complex networks or generative models to mitigate the cross-modality discrepancy while ignoring the fact that the modality gaps differ between the different VIS and NIR images. Different from existing methods, in this paper, we propose an Adaptive Middle-modality Alignment Learning (AMML) method, which can effectively reduce the modality discrepancy via an adaptive middle modality learning strategy at both image level and feature level. The proposed AMML method enjoys several merits. First, we propose an Adaptive Middle-modality Generator (AMG) module to reduce the modality discrepancy between the VIS and NIR images from the image level, which can effectively project the VIS and NIR images into a unified middle modality image (UMMI) space to adaptively generate middle-modality (M-modality) images. Second, we propose a feature-level Adaptive Distribution Alignment (ADA) loss to force the distribution of the VIS features and NIR features adaptively align with the distribution of M-modality features. Moreover, we also propose a novel Center-based Diverse Distribution Learning (CDDL) loss, which can effectively learn diverse cross-modality knowledge from different modalities while reducing the modality discrepancy between the VIS and NIR modalities. Extensive experiments on three challenging VIReID datasets show the superiority of the proposed AMML method over the other state-of-the-art methods. More remarkably, our method achieves 77.8% in terms of Rank-1 and 74.8% in terms of mAP on the SYSU-MM01 dataset for all search mode, and 86.6% in terms of Rank-1 and 88.3% in terms of mAP on the SYSU-MM01 dataset for indoor search mode. The code is released at: https://github.com/ZYK100/MMN.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142597431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking Contemporary Deep Learning Techniques for Error Correction in Biometric Data","authors":"YenLung Lai, XingBo Dong, Zhe Jin, Wei Jia, Massimo Tistarelli, XueJun Li","doi":"10.1007/s11263-024-02280-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02280-8","url":null,"abstract":"<p>In the realm of cryptography, the implementation of error correction in biometric data offers many benefits, including secure data storage and key derivation. Deep learning-based decoders have emerged as a catalyst for improved error correction when decoding noisy biometric data. Although these decoders exhibit competence in approximating precise solutions, we expose the potential inadequacy of their security assurances through a minimum entropy analysis. This limitation curtails their applicability in secure biometric contexts, as the inherent complexities of their non-linear neural network architectures pose challenges in modeling the solution distribution precisely. To address this limitation, we introduce U-Sketch, a universal approach for error correction in biometrics, which converts arbitrary input random biometric source distributions into independent and identically distributed (i.i.d.) data while maintaining the pairwise distance of the data post-transformation. This method ensures interpretability within the decoder, facilitating transparent entropy analysis and a substantiated security claim. Moreover, U-Sketch employs Maximum Likelihood Decoding, which provides optimal error tolerance and a precise security guarantee.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"48 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142588713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}