{"title":"Few-shot Object Detection via Improved Classification Features","authors":"Xinyu Jiang, Zhengjia Li, Maoqing Tian, Jianbo Liu, Shuai Yi, D. Miao","doi":"10.1109/WACV56688.2023.00535","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00535","url":null,"abstract":"Few-shot object detection (FSOD) aims to transfer knowledge from base classes to novel classes, which receives widespread attention recently. The performance of current techniques is, however, limited by the poor classification ability and the improper features in the detection head. To circumvent this issue, we propose a Multi-level Feature Enhancement (MFE) model to improve the feature for classification from three different perspectives, including the spatial level, the task level and the regularization level. First, we revise the classifier’s input feature at the spatial level by using information from the regression head. Secondly, we separate the RoI-Align feature into two different feature distributions in order to improve features at the task level. Finally, taking into account the overfitting problem in FSOD, we design a simple but efficient regularization enhancement module to sample features into various distributions and enhance the regularization ability of classification. Extensive experiments show that our method achieves competitive results on PASCAL VOC datasets, and exceeds current state-of-the-art methods in all shot settings on challenging MS-COCO datasets.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"3 3-4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114134023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Text-Guided Object Detector for Multi-modal Video Question Answering","authors":"Ruoyue Shen, Nakamasa Inoue, K. Shinoda","doi":"10.1109/WACV56688.2023.00109","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00109","url":null,"abstract":"Video Question Answering (Video QA) is a task to answer a text-format question based on the understanding of linguistic semantics, visual information, and also linguistic-visual alignment in the video. In Video QA, an object detector pre-trained with large-scale datasets, such as Faster R-CNN, has been widely used to extract visual representations from video frames. However, it is not always able to precisely detect the objects needed to answer the question be-cause of the domain gaps between the datasets for training the object detector and those for Video QA. In this paper, we propose a text-guided object detector (TGOD), which takes text question-answer pairs and video frames as inputs, detects the objects relevant to the given text, and thus provides intuitive visualization and interpretable results. Our experiments using the STAGE framework on the TVQA+ dataset show the effectiveness of our proposed detector. It achieves a 2.02 points improvement in accuracy of QA, 12.13 points improvement in object detection (mAP50), 1.1 points improvement in temporal location, and 2.52 points improvement in ASA over the STAGE original detector.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114272643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"More Knowledge, Less Bias: Unbiasing Scene Graph Generation with Explicit Ontological Adjustment","authors":"Zhanwen Chen, Saed Rezayi, Sheng Li","doi":"10.1109/WACV56688.2023.00401","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00401","url":null,"abstract":"Scene graph generation (SGG) models seek to detect relationships between objects in a given image. One challenge in this area is the biased distribution of predicates in the dataset and the semantic space. Recent works incorporating knowledge graphs with scene graphs prove effective in improving recall for the tail predicate classes. Moreover, many recent SGG approaches with promising results explicitly redistribute the predicates in both the training process and in the prediction step. To incorporate external knowledge, we construct a commonsense knowledge graph by integrating ConceptNet and Wikidata. To explicitly unbias SGG with knowledge in the reasoning process, we propose a novel framework, Explicit Ontological Adjustment (EOA), to adjust the graph model predictions with knowledge priors. We use the edge matrix from the commonsense knowledge graph as a module in the graph neural network model to refine the relationship detection process. This module proves effective in alleviating the long-tail distribution of predicates. When combined, we show that these modules achieve state-of-the-art performance on the Visual Genome dataset in most cases. The source code is available at https://github.com/zhanwenchen/eoa.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115160195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Westfechtel, Hao-Wei Yeh, Qier Meng, Yusuke Mukuta, Tatsuya Harada
{"title":"Backprop Induced Feature Weighting for Adversarial Domain Adaptation with Iterative Label Distribution Alignment","authors":"Thomas Westfechtel, Hao-Wei Yeh, Qier Meng, Yusuke Mukuta, Tatsuya Harada","doi":"10.1109/WACV56688.2023.00047","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00047","url":null,"abstract":"The requirement for large labeled datasets is one of the limiting factors for training accurate deep neural networks. Unsupervised domain adaptation tackles this problem of limited training data by transferring knowledge from one domain, which has many labeled data, to a different domain for which little to no labeled data is available. One common approach is to learn domain-invariant features for example with an adversarial approach. Previous methods often train the domain classifier and label classifier network separately, where both classification networks have little interaction with each other. In this paper, we introduce a classifier-based backprop-induced weighting of the feature space. This approach has two main advantages. Firstly, it lets the domain classifier focus on features that are important for the classification, and, secondly, it couples the classification and adversarial branch more closely. Furthermore, we introduce an iterative label distribution alignment method, that employs results of previous runs to approximate a class-balanced dataloader. We conduct experiments and ablation studies on three benchmarks Office-31, Office-Home, and DomainNet to show the effectiveness of our proposed algorithm.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114203832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-scale Cell-based Layout Representation for Document Understanding","authors":"Yuzhi Shi, Mijung Kim, Yeongnam Chae","doi":"10.1109/WACV56688.2023.00366","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00366","url":null,"abstract":"Deep learning techniques have achieved remarkable progress in document understanding. Most models use co-ordinates to represent absolute or relative spatial information of components, but they are difficult to represent latent rules in the document layout. This makes learning layout representation to be more difficult. Unlike the previous researches which have employed the coordinate system, graph or grid to represent the document layout, we propose a novel layout representation, the cell-based layout, to provide easy-to-understand spatial information for backbone models. In line with human reading habits, it uses cell information, i.e. row and column index, to represent the position of components in a document, and makes the document layout easier to understand. Furthermore, we proposed the multi-scale layout to represent the hierarchical structure of layout, and developed a data augmentation method to improve the performance. Experiment results show that our method achieves the state-of-the-art performance in text-based tasks, including form understanding and receipt understanding, and improves the performance in image-based task such as document image classification. We released the code in the repoa.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114869683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zirui An, Jingbo Yu, Runtao Liu, Chuan Wang, Qian Yu
{"title":"SketchInverter: Multi-Class Sketch-Based Image Generation via GAN Inversion","authors":"Zirui An, Jingbo Yu, Runtao Liu, Chuan Wang, Qian Yu","doi":"10.1109/WACV56688.2023.00430","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00430","url":null,"abstract":"This paper proposes the first GAN inversion-based method for multi-class sketch-based image generation (MCSBIG). MC-SBIG is a challenging task that requires strong prior knowledge due to the significant domain gap between sketches and natural images. Existing learning-based approaches rely on a large-scale paired dataset to learn the mapping between these two image modalities. However, since the public paired sketch-photo data are scarce, it is struggling for learning-based methods to achieve satisfactory results. In this work, we introduce a new approach based on GAN inversion, which can utilize a powerful pretrained generator to facilitate image generation from a given sketch. Our GAN inversion-based method has two advantages: 1. it can freely take advantage of the prior knowledge of a pretrained image generator; 2. it allows the proposed model to focus on learning the mapping from a sketch to a low-dimension latent code, which is a much easier task than directly mapping to a high-dimension natural image. We also present a novel shape loss to improve generation quality further. Extensive experiments are conducted to show that our method can produce sketch-faithful and photo-realistic images and significantly outperform the baseline methods.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127244423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bhavin Jawade, D. Mohan, Naji Mohamed Ali, S. Setlur, Venugopal Govindaraju
{"title":"NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings","authors":"Bhavin Jawade, D. Mohan, Naji Mohamed Ali, S. Setlur, Venugopal Govindaraju","doi":"10.1109/WACV56688.2023.00119","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00119","url":null,"abstract":"Cross-modal retrieval is a fundamental vision-language task with a broad range of practical applications. Text-to-image matching is the most common form of cross-modal retrieval where, given a large database of images and a textual query, the task is to retrieve the most relevant set of images. Existing methods utilize dual encoders with an attention mechanism and a ranking loss for learning embeddings that can be used for retrieval based on cosine similarity. Despite the fact that these methods attempt to perform semantic alignment across visual regions and textual words using tailored attention mechanisms, there is no explicit supervision from the training objective to enforce such alignment. To address this, we propose NAPReg, a novel regularization formulation that projects high-level semantic entities i.e Nouns into the embedding space as shared learnable proxies. We show that using such a formulation allows the attention mechanism to learn better word-region alignment while also utilizing region information from other samples to build a more generalized latent representation for semantic concepts. Experiments on three benchmark datasets i.e. MS-COCO, Flickr30k and Flickr8k demonstrate that our method achieves state-of-the-art results in cross-modal metric learning for text-image and image-text retrieval tasks. Code: https://github.com/bhavinjawade/NAPReq","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125418047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SSFE-Net: Self-Supervised Feature Enhancement for Ultra-Fine-Grained Few-Shot Class Incremental Learning","authors":"Zicheng Pan, Xiaohan Yu, Miaohua Zhang, Yongsheng Gao","doi":"10.1109/WACV56688.2023.00621","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00621","url":null,"abstract":"Ultra-Fine-Grained Visual Categorization (ultra-FGVC) has become a popular problem due to its great real-world potential for classifying the same or closely related species with very similar layouts. However, there present many challenges for the existing ultra-FGVC methods, firstly there are always not enough samples in the existing ultraFGVC datasets based on which the models can easily get overfitting. Secondly, in practice, we are likely to find new species that we have not seen before and need to add them to existing models, which is known as incremental learning. The existing methods solve these problems by Few-Shot Class Incremental Learning (FSCIL), but the main challenge of the FSCIL models on ultra-FGVC tasks lies in their inferior discrimination detection ability since they usually use low-capacity networks to extract features, which leads to insufficient discriminative details extraction from ultrafine-grained images. In this paper, a self-supervised feature enhancement for the few-shot incremental learning network (SSFE-Net) is proposed to solve this problem. Specifically, a self-supervised learning (SSL) and knowledge distillation (KD) framework is developed to enhance the feature extraction of the low-capacity backbone network for ultra-FGVC few-shot class incremental learning tasks. Besides, we for the first time create a series of benchmarks for FSCIL tasks on two public ultra-FGVC datasets and three normal finegrained datasets, which will facilitate the development of the Ultra-FGVC community. Extensive experimental results on public ultra-FGVC datasets and other state-of-the-art benchmarks consistently demonstrate the effectiveness of the proposed method.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125644695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Sun, Jason Kuen, Zhe Lin, Philippos Mordohai, Simon Chen
{"title":"PRN: Panoptic Refinement Network","authors":"Bo Sun, Jason Kuen, Zhe Lin, Philippos Mordohai, Simon Chen","doi":"10.1109/WACV56688.2023.00395","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00395","url":null,"abstract":"Panoptic segmentation is the task of uniquely assigning every pixel in an image to either a semantic label or an individual object instance, generating a coherent and complete scene description. Many current panoptic segmentation methods, however, predict masks of semantic classes and object instances in separate branches, yielding inconsistent predictions. Moreover, because state-of-the-art panoptic segmentation models rely on box proposals, the instance masks predicted are often of low-resolution. To overcome these limitations, we propose the Panoptic Refinement Network (PRN), which takes masks from base panoptic segmentation models and refines them jointly to produce coherent results. PRN extends the offset map-based architecture of Panoptic-Deeplab with several novel ideas including a foreground mask and instance bounding box offsets, as well as coordinate convolutions for improved spatial prediction. Experimental results on COCO and Cityscapes show that PRN can significantly improve already accurate results from a variety of panoptic segmentation networks.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125973764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aitor Artola, Yannis Kolodziej, J. Morel, T. Ehret
{"title":"GLAD: A Global-to-Local Anomaly Detector","authors":"Aitor Artola, Yannis Kolodziej, J. Morel, T. Ehret","doi":"10.1109/WACV56688.2023.00546","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00546","url":null,"abstract":"Learning to detect automatic anomalies in production plants remains a machine learning challenge. Since anomalies by definition cannot be learned, their detection must rely on a very accurate \"normality model\". To this aim, we introduce here a global-to-local Gaussian model for neural network features, learned from a set of normal images. This probabilistic model enables unsupervised anomaly detection. A global Gaussian mixture model of the features is first learned using all available features from normal data. This global Gaussian mixture model is then localized by an adaptation of the K-MLE algorithm, which learns a spatial weight map for each Gaussian. These weights are then used instead of the mixture weights to detect anomalies. This method enables precise modeling of complex data, even with limited data. Applied on WideResnet50-2 features, our approach outperforms the previous state of the art on the MVTec dataset, particularly on the object category. It is robust to perturbations that are frequent in production lines, such as imperfect alignment, and is on par in terms of memory and computation time with the previous state of the art.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126027598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}