Xian Yeow Lee, L. Vidyaratne, M. Alam, Ahmed K. Farahat, Dipanjan Ghosh, Teresa Gonzalez Diaz, Chetan Gupta
{"title":"XDNet: A Few-Shot Meta-Learning Approach for Cross-Domain Visual Inspection","authors":"Xian Yeow Lee, L. Vidyaratne, M. Alam, Ahmed K. Farahat, Dipanjan Ghosh, Teresa Gonzalez Diaz, Chetan Gupta","doi":"10.1109/CVPRW59228.2023.00460","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00460","url":null,"abstract":"Automated visual inspection has the potential to improve the efficiency and accuracy of inspection tasks across various industries. Deep learning models have been at the forefront of many automated visual inspection technologies. In this work, we focus on a specific instance of a visual inspection problem: the defect detection and classification problem. Training a deep learning model from scratch to detect defects is challenging due to the scarcity of labeled images with defects. Moreover, it is progressively more challenging to adapt a deep learning model across different domains using limited labeled data. We propose a cross-domain meta-learning framework, XDNet, to solve the defect classification problem using a few labeled samples. XDNet is inspired by recent advancements in pre-trained backbone models as general feature extractors and meta-learning frameworks, which adapt across different domains using non-parametric classifiers under limited computational resources. We demonstrate the efficacy of XDNet using a benchmark anomaly detection dataset which we re-formulate as a defect detection and classification problem. Experimental results suggest that XDNet performs significantly better (≈ 17%) than the existing state-of-the-art and baseline models. Additionally, we perform an ablation study to identify the important components that contribute to the improved performance of the proposed framework. Finally, we conduct a data domain-specific analysis to understand the potential strengths and drawbacks of XDNet on different types of defects.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115661977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Category Differences Matter: A Broad Analysis of Inter-Category Error in Semantic Segmentation","authors":"Jingxing Zhou, Jürgen Beyerer","doi":"10.1109/CVPRW59228.2023.00401","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00401","url":null,"abstract":"In current evaluation schemes of semantic segmentation, metrics are calculated in such a way that all predicted classes should equally be identical to their ground truth, paying less attention to the various manifestations of the false predictions within the object category. In this work, we propose the Critical Error Rate (CER) as a supplement to the current evaluation metrics, focusing on the error rate, which reflects predictions that fall outside of the category from the ground truth. We conduct a series of experiments evaluating the behavior of different network architectures in various evaluation setups, including domain shift, the introduction of novel classes, and a mixture of these. We demonstrate the essential criteria for network generalization with those experiments. Furthermore, we ablate the impact of utilizing various class taxonomies for the evaluation of out-of-category error.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116660859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OTST: A Two-Phase Framework for Joint Denoising and Remosaicing in RGBW CFA","authors":"Zhihao Fan, Xun Wu, Fanqing Meng, Yaqi Wu, Feng Zhang","doi":"10.1109/CVPRW59228.2023.00284","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00284","url":null,"abstract":"RGBW, a newly emerged type of Color Filter Array (CFA), possesses strong low-light photography capabilities. RGBW CFA shows significant application value when low-light sensitivity is critical, such as in security cameras and smartphones. However, the majority of commercial image signal processors (ISP) are primarily designed for Bayer CFA, research pertaining to RGBW CFA is very rare. To address above limitations, in this study, we propose a two-phase framework named OTST for the RGBW Joint Denoising and Remosaicing (RGBW-JRD) task. For the denoising stage, we propose Omni-dimensional Dynamic Convolution based Half-Shuffle Transformer (ODC-HST) which can fully utilize image’s long-range dependencies to dynamically remove the noise. For the remosaicing stage, we propose a Spatial Compressive Transformer (SCT) to efficiently capture both local and global dependencies across spatial and channel dimensions. Experimental results demonstrate that our two-phase RGBW-JRD framework outperforms existing RGBW denoising and remosaicing solutions across a wide range of noise levels. In addition, the proposed approach ranks the 2nd place in MIPI 2023 RGBW Joint Remosaic and Denoise competition.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117174589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Defending Low-Bandwidth Talking Head Videoconferencing Systems From Real-Time Puppeteering Attacks","authors":"Danial Samadi Vahdati, T. D. Nguyen, M. Stamm","doi":"10.1109/CVPRW59228.2023.00105","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00105","url":null,"abstract":"Talking head videos have gained significant attention in recent years due to advances in AI that allow for the synthesis of realistic videos from only a single image of the speaker. Recently, researchers have proposed low bandwidth talking head video systems for use in applications such as videoconferencing and video calls. However, these systems are vulnerable to puppeteering attacks, where an attacker can control a synthetic version of a different target speaker in real-time. This can be potentially used spread misinformation or committing fraud. Because the receiver always creates a synthetic video of the speaker, deepfake detectors cannot protect against these attacks. As a result, there are currently no defenses against puppeteering in these systems. In this paper, we propose a new defense against puppeteering attacks in low-bandwidth talking head video systems by utilizing the biometric information inherent in the facial expression and pose data transmitted to the receiver. Our proposed system requires no modifications to the video transmission system and operates with low computational cost. We present experimental evidence to demonstrate the effectiveness of our proposed defense and provide a new dataset for benchmarking defenses against puppeteering attacks.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121123125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takuya Nakabayashi, Kunihiro Hasegawa, M. Matsugu, H. Saito
{"title":"Event-based Blur Kernel Estimation For Blind Motion Deblurring","authors":"Takuya Nakabayashi, Kunihiro Hasegawa, M. Matsugu, H. Saito","doi":"10.1109/CVPRW59228.2023.00433","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00433","url":null,"abstract":"Motion blur can significantly reduce the quality of images, and researchers have developed various algorithms to address this issue. One common approach to deblurring is to use deconvolution to cancel out the blur effect, but this method is limited by the difficulty of accurately estimating blur kernels from blurred images. This is because the motion causing the blur is often complex and nonlinear. In this paper, a new method for estimating blur kernels is proposed. This method uses an event camera, which captures high-temporal-resolution data on pixel luminance changes, along with a conventional camera to capture the input blurred image. By analyzing the event data stream, the proposed method estimates the 2D motion of the blurred image at short intervals during the exposure time, and integrates this information to estimate a variety of complex blur motions. With the estimated blur kernel, the input blurred image can be deblurred using deconvolution. The proposed method does not rely on machine learning and therefore can restore blurry images without depending on the quality and quantity of training data. Experimental results show that the proposed method can estimate blur kernels even for images blurred by complex camera motions, outperforming conventional methods. Overall, this paper presents a promising approach to motion deblurring that could have practical applications in a range of fields.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127454056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Video Frame Redundancies for Efficient Data Sampling and Annotation in Instance Segmentation","authors":"Jihun Yoon, Min-Kook Choi","doi":"10.1109/CVPRW59228.2023.00333","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00333","url":null,"abstract":"In recent years, deep neural network architectures and learning algorithms have greatly improved the performance of computer vision tasks. However, acquiring and annotating large-scale datasets for training such models can be expensive. In this work, we explore the potential of reducing dataset sizes by leveraging redundancies in video frames, specifically for instance segmentation. To accomplish this, we investigate two sampling strategies for extracting keyframes, uniform frame sampling with adjusted stride (UFS) and adaptive frame sampling (AFS), which employs visual (Optical flow, SSIM) or semantic (feature representations) dissimilarities measured by learning free methods. In addition, we show that a simple copy-paste augmentation can bridge the big mAP gap caused by frame reduction. We train and evaluate Mask R-CNN with the BDD100K MOTS dataset and verify the potential of reducing training data by extracting keyframes in the video. With only 20% of the data, we achieve similar performance to the full dataset mAP; with only 33% of the data, we surpass it. Lastly, based on our findings, we offer practical solutions for developing effective sampling methods and data annotation strategies for instance segmentation models. Supplementary on https://github.com/jihun-yoon/EVFR.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124864432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Verma, D. Sanny, S. Kulkarni, Prateek Sircar, Abhishek Singh, D. Gupta
{"title":"SkiLL: Skipping Color and Label Landscape: Self Supervised Design Representations for Products in E-commerce","authors":"V. Verma, D. Sanny, S. Kulkarni, Prateek Sircar, Abhishek Singh, D. Gupta","doi":"10.1109/CVPRW59228.2023.00354","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00354","url":null,"abstract":"Understanding the design of a product without human supervision is a crucial task for e-commerce services. Such a capability can help in multiple downstream e-commerce tasks like product recommendations, design trend analysis, image-based search, and visual information retrieval, etc. For this task, getting fine-grain label data is costly and not scalable for the e-commerce product. In this paper, we leverage knowledge distillation based self-supervised learning (SSL) approach to learn design representations. These representations do not require human annotation for training and focus on only design related attributes of a product and ignore attributes like color, orientation, etc. We propose a global and task specific local augmentation space which captures the desired image information and provides robust visual embedding. We evaluated our model for the three highly diverse datasets, and also propose and measure a quantitative metric to evaluate the model’s color invariant feature learning ability. In all scenarios, our proposed approach outperforms the recent SSL model by upto 8.6% in terms of accuracy.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126019497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haolin Jia, Qifei Wang, Omer Tov, Yang Zhao, Fei Deng, Lu Wang, Chuo-Ling Chang, Tingbo Hou, Matthias Grundmann
{"title":"BlazeStyleGAN: A Real-Time On-Device StyleGAN","authors":"Haolin Jia, Qifei Wang, Omer Tov, Yang Zhao, Fei Deng, Lu Wang, Chuo-Ling Chang, Tingbo Hou, Matthias Grundmann","doi":"10.1109/CVPRW59228.2023.00495","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00495","url":null,"abstract":"StyleGAN models have been widely adopted for generating and editing face images. Yet, few work investigated running StyleGAN models on mobile devices. In this work, we introduce BlazeStyleGAN — to the best of our knowledge, the first StyleGAN model that can run in real-time on smartphones. We design an efficient synthesis network with the auxiliary head to convert features to RGB at each level of the generator, and only keep the last one at inference. We also improve the distillation strategy with a multi-scale perceptual loss using the auxiliary heads, and an adversarial loss for the student generator and discriminator. With these optimizations, BlazeStyleGAN can achieve real-time performance on high-end mobile GPUs. Experimental results demonstrate that BlazeStyleGAN generates high-quality face images and even mitigates some artifacts from the teacher model.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125405527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scoring Your Prediction on Unseen Data","authors":"Yuhao Chen, Shen Zhang, Renjie Song","doi":"10.1109/CVPRW59228.2023.00330","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00330","url":null,"abstract":"The performance of deep neural networks can vary substantially when evaluated on datasets different from the training data. This presents a crucial challenge in evaluating models on unseen data without access to labels. Previous methods compute a single model-based indicator at the dataset level and use regression methods to predict performance. To evaluate the model more accurately, we propose a sample-level label-free model evaluation method for better prediction on unseen data, named Scoring Your Prediction (SYP). Specifically, SYP introduces low-level image-based features (e.g., blurriness) to model image quality that is important for classification. We complementarily combine model-based indicators and image-based indicators to enhance sample representation. Additionally, we predict the probability that each sample is correctly classified using a neural network named oracle model. Compared to other existing methods, the proposed method outperforms them on 40 unlabeled datasets transformed by CIFAR-10. Especially, SYP lowers RMSE by 1.83-3.97 for ResNet-56 evaluation and 2.32-9.74 for RepVGG-A0 evaluation compared with latest methods. Note that our scheme won the championship on the DataCV Challenge at CVPR 2023. Source code is avaliabe at https://github.com/megvii-research/SYP.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126847428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fashion-Specific Ambiguous Expression Interpretation with Partial Visual-Semantic Embedding","authors":"Ryotaro Shimizu, Takuma Nakamura, M. Goto","doi":"10.1109/CVPRW59228.2023.00353","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00353","url":null,"abstract":"A novel technology named fashion intelligence system has been proposed to quantify ambiguous expressions unique to fashion, such as \"casual,\" \"adult-casual,\" and \"office-casual,\" and to support users’ understanding of fashion. However, the existing visual-semantic embedding (VSE) model, which is the basis of its system, does not support situations in which images are composed of multiple parts such as hair, tops, pants, skirts, and shoes. We propose partial VSE, which enables sensitive learning for each part of the fashion outfits. This enables five types of practical functionalities, particularly image-retrieval tasks in which changes are made only to the specified parts and image-reordering tasks that focus on the specified parts by the single model. Based on both the multiple unique qualitative and quantitative evaluation experiments, we show the effectiveness of the proposed model.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116038344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}