{"title":"Multi-modal Emotion Reaction Intensity Estimation with Temporal Augmentation","authors":"Feng Qiu, Bowen Ma, Wei Zhang, Yu-qiong Ding","doi":"10.1109/CVPRW59228.2023.00613","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00613","url":null,"abstract":"Emotion reaction intensity (ERI) estimation aims to estimate the emotion intensities of subjects reacting to various video-based stimuli. It plays an important role in human affective behavior analysis. In this paper, we proposed a effective solution for addressing the task of ERI estimation in the fifth Affective Behavior Analysis in the wild (ABAW) competition. Based on multi-modal information, We first extract uni-modal features from images, speeches and texts, respectively and then regress the intensities of 7 emotions. To enhance the model generalization and capture context information, we employ the Temporal Augmentation module to adapt to various video samples and the Temporal SE Block to reweight temporal features adaptively. The extensive experiments conducted on large-scale dataset, Hume-Reaction, demonstrate the effectiveness of our approach. Our method achieves average pearson’s correlations coefficient of 0.4160 on the validation set and obtain third place in the ERI Estimation Challenge of ABAW 2023.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131279525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAVLI - Using image associations to produce local concept-based explanations","authors":"Pushkar Shukla, Sushil Bharati, Matthew A. Turk","doi":"10.1109/CVPRW59228.2023.00387","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00387","url":null,"abstract":"While explainability is becoming increasingly crucial in computer vision and machine learning, producing explanations that can link decisions made by deep neural networks to concepts that are easily understood by humans still remains a challenge. To address this challenge, we propose a framework that produces local concept-based explanations for the classification decisions made by a deep neural network. Our framework is based on the intuition that if there is a high overlap between the regions of the image that are associated with a human-defined concept and regions of the image that are useful for decision-making, then the decision is highly dependent on the concept. Our proposed CAVLI framework combines a global approach (TCAV) with a local approach (LIME). To test the effectiveness of the approach, we conducted experiments on both the ImageNet and CelebA datasets. These experiments validate the ability of our framework to quantify the dependence of individual decisions on predefined concepts. By providing local concept-based explanations, our framework has the potential to improve the transparency and interpretability of deep neural networks in a variety of applications.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"20 13","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131505717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Three-Stage Framework with Reliable Sample Pool for Long-Tailed Classification","authors":"Feng Cai, Keyu Wu, Haipeng Wang, Feng Wang","doi":"10.1109/CVPRW59228.2023.00054","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00054","url":null,"abstract":"Synthetic Aperture Radar (SAR) imagery presents a promising solution for acquiring Earth surface information regardless of weather and daylight. However, the SAR dataset is commonly characterized by a long-tailed distribution due to the scarcity of samples from infrequent categories. In this work, we extend the problem to aerial view object classification in the SAR dataset with long-tailed distribution and a plethora of negative samples. Specifically, we propose a three-stage approach that employs a ResNet101 backbone for feature extraction, Class-balanced Focal Loss for class-level re-weighting, and reliable pseudo-labels generated through semi-supervised learning to improve model performance. Moreover, we introduce a Reliable Sample Pool (RSP) to enhance the model's confidence in predicting in-distribution data and mitigate the domain gap between the labeled and unlabeled sets. The proposed framework achieved a Top-1 Accuracy of 63.20% and an AUROC of 0.71 on the final dataset, winning the first place in track 1 of the PBVS 2023 Multi-modal Aerial View Object Classification Challenge.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131555834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ultra-Sonic Sensor based Object Detection for Autonomous Vehicles","authors":"T. Nesti, Santhosh Boddana, Burhaneddin Yaman","doi":"10.1109/CVPRW59228.2023.00026","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00026","url":null,"abstract":"Perception systems in autonomous vehicles (AV) have made significant advancements in recent years. Such systems leverage different sensing modalities such as cameras, LiDARs and Radars, and are powered by state-of-the-art deep learning algorithms. Ultrasonic sensors (USS) are a low-cost, durable and robust sensing technology that is particularly suitable for near-range detection in harsh weather conditions, but have received very limited attention in the perception literature. In this work, we present a novel USS-based object detection system that can enable accurate detection of objects in low-speed scenarios. The proposed pipeline involves four steps. First, the input USS data is transformed into a novel voxelized 3D point cloud leveraging the physics of USS. Next, multi-channels Bird Eye’s View (BEV) images are generated via projection operators. Later, the resolution of BEV images is enhanced by means of a rolling-window, vehicle movement-aware temporal aggregation process. Finally, the image-like data representation is used to train a deep neural network to detect and localize objects in the 2D plane. We present extensive experiments showing that the proposed framework achieves satisfactory performance across both classic and custom object detection metrics, thus bridging the usecase and literature visibility gap between USS and more established sensors.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115169688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving language-supervised object detection with linguistic structure analysis","authors":"Arushi Rai, Adriana Kovashka","doi":"10.1109/CVPRW59228.2023.00588","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00588","url":null,"abstract":"Language-supervised object detection typically uses descriptive captions from human-annotated datasets. However, in-the-wild captions take on wider styles of language. We analyze one particular ubiquitous form of language: narrative. We study the differences in linguistic structure and visual-text alignment in narrative and descriptive captions and find we can classify descriptive and narrative style captions using linguistic features such as part of speech, rhetoric structure theory, and multimodal discourse. Then, we use this to select captions from which to extract image-level labels as supervision for weakly supervised object detection. We also improve the quality of extracted labels by filtering based on proximity to verb types for both descriptive and narrative captions.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121266402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tripti Shukla, Paridhi Maheshwari, Rajhans Singh, Ankit Shukla, K. Kulkarni, P. Turaga
{"title":"Scene Graph Driven Text-Prompt Generation for Image Inpainting","authors":"Tripti Shukla, Paridhi Maheshwari, Rajhans Singh, Ankit Shukla, K. Kulkarni, P. Turaga","doi":"10.1109/CVPRW59228.2023.00083","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00083","url":null,"abstract":"Scene editing methods are undergoing a revolution, driven by text-to-image synthesis methods. Applications in media content generation have benefited from a careful set of engineered text prompts, that have been arrived at by the artists by trial and error. There is a growing need to better model prompt generation, for it to be useful for a broad range of consumer-grade applications. We propose a novel method for text prompt generation for the explicit purpose of consumer-grade image inpainting, i.e. insertion of new objects into missing regions in an image. Our approach leverages existing inter-object relationships to generate plausible textual descriptions for the missing object, that can then be used with any text-to-image generator. Given an image and a location where a new object is to be inserted, our approach first converts the given image to an intermediate scene graph. Then, we use graph convolutional networks to ‘expand’ the scene graph by predicting the identity and relationships of the new object to be inserted, with respect to the existing objects in the scene. The output of the expanded scene graph is cast into a textual description, which is then processed by a text-to-image generator, conditioned on the given image, to produce the final inpainted image. We conduct extensive experiments on the Visual Genome dataset, and show through qualitative and quantitative metrics that our method is superior to other methods.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128105477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phanideep Gampa, Akash Anil Valsangkar, Shailesh Choubey, Pooja A
{"title":"Prioritised Moderation for Online Advertising","authors":"Phanideep Gampa, Akash Anil Valsangkar, Shailesh Choubey, Pooja A","doi":"10.1109/CVPRW59228.2023.00194","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00194","url":null,"abstract":"Online advertisement industry aims to build a preference for a product over its competitors by making consumers aware of the product at internet scale. However, the ads that violate the applicable laws and location specific regulations can have serious business impact with legal implications. At the same time, customers are at risk of getting exposed to egregious ads resulting in a bad user experience. Due to the limited and costly human bandwidth, moderating ads at the industry scale is a challenging task. Typically at Amazon Advertising, we deal with ad moderation workflows where the ad distributions are skewed by non defective ads. It is desirable to increase the review time that the human moderators spend on moderating genuine defective ads. Hence prioritisation of deemed defective ads for human moderation is crucial for the effective utilisation of human bandwidth in the ad moderation workflow. To incorporate the business knowledge and to better deal with the possible overlaps between the policies, we formulate this as a policy gradient ranking algorithm with custom scalar rewards. Our extensive experiments demonstrate that these techniques show a substantial gain in number of defective ads caught against various tabular classification algorithms, resulting in effective utilisation of human moderation bandwidth.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128145271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yufei Zhang, Rui Zhao, Ziyi Zhao, Naveen Ramakrishnan, Manoj Aggarwal, G. Medioni, Q. Ji
{"title":"Robust Partial Fingerprint Recognition","authors":"Yufei Zhang, Rui Zhao, Ziyi Zhao, Naveen Ramakrishnan, Manoj Aggarwal, G. Medioni, Q. Ji","doi":"10.1109/CVPRW59228.2023.00108","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00108","url":null,"abstract":"Low quality capture and obstruction on fingers often result in partially visible fingerprint images, which imposes challenge for fingerprint recognition. In this work, motivated from the practical use cases, we first systematically studied different types of partial occlusion. Specifically, two major types of partial occlusion, including six granular types, and the corresponding methods to simulate each type for model evaluation and improvement were introduced. Second, we proposed a novel Robust Partial Fingerprint (RPF) recognition framework to mitigate the performance degradation due to occlusion. RPF effectively encodes the knowledge about partial fingerprints through occlusion-enhanced data augmentation, and explicitly captures the missing regions for robust feature extraction through occlusion-aware modeling. Finally, we demonstrated the effectiveness of RPF through extensive experiments. Particularly, baseline fingerprint recognition models can degrade the recognition accuracy measured in FRR @ FAR=0.1% from 14.67% to 17.57% at 10% occlusion ratio on the challenging NIST dataset, while RPF instead improves the recognition performance to 9.99% under the same occlusion ratio. Meanwhile, we presented a set of empirical analysis through visual explanation, matching score analysis, and uncertainty modeling, providing insights into the recognition model’s behavior and potential directions of enhancement.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132630537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Unified Approach to Facial Affect Analysis: the MAE-Face Visual Representation","authors":"Bowen Ma, Wei Zhang, Feng Qiu, Yu-qiong Ding","doi":"10.1109/CVPRW59228.2023.00630","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00630","url":null,"abstract":"Facial affect analysis is essential for understanding human expressions and behaviors, encompassing action unit (AU) detection, expression (EXPR) recognition, and valence-arousal (VA) estimation. The CVPR 2023 Competition on Affective Behavior Analysis in-the-wild (ABAW) is dedicated to providing a high-quality and large-scale Affwild2 dataset for identifying widely used emotion representations. In this paper, we employ MAE-Face as a unified approach to develop robust visual representations for facial affect analysis. We propose multiple techniques to improve its fine-tuning performance on various downstream tasks, incorporating a two-pass pre-training process and a two-pass fine-tuning process. Our approach exhibits strong results on numerous datasets, highlighting its versatility. Moreover, the proposed model acts as a fundamental component for our final framework in the ABAW5 competition. Our submission achieves outstanding outcomes, ranking first place in the AU and EXPR tracks and second place in the VA track.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134039155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wassim A. El Ahmar, Yahya Massoud, Dhanvin Kolhatkar, Hamzah Alghamdi, Mohammad Al Ja'afreh, R. Laganière, R. Hammoud
{"title":"Enhanced Thermal-RGB Fusion for Robust Object Detection","authors":"Wassim A. El Ahmar, Yahya Massoud, Dhanvin Kolhatkar, Hamzah Alghamdi, Mohammad Al Ja'afreh, R. Laganière, R. Hammoud","doi":"10.1109/CVPRW59228.2023.00042","DOIUrl":"https://doi.org/10.1109/CVPRW59228.2023.00042","url":null,"abstract":"Thermal imaging has seen rapid development in the last few years due to its robustness in different weather and lighting conditions and its reduced production cost. In this paper, we study the performance of different RGB-Thermal fusion methods in the task of object detection, and introduce a new RGB-Thermal fusion approach that enhances the performance by up to 9% using a sigmoid-activated gating mechanism for early fusion. We conduct our experiments on an enhanced version of the City Scene RGB-Thermal MOT Dataset where we register the RGB and corresponding thermal images in order to conduct fusion experiments. Finally, we benchmark the speed of our proposed fusion method and show that it adds negligible overhead to the model processing time. Our work would be useful for autonomous systems and any multi-model machine vision system. The improved version of the dataset, our trained models, and source code are available at https://github.com/wassimea/rgb-thermalfusion.","PeriodicalId":355438,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134543726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}