{"title":"Intrinsic Imaging Model Enhanced Contrastive Face Representation Learning","authors":"Haomiao Sun, S. Shan, Hu Han","doi":"10.1109/FG57933.2023.10042802","DOIUrl":"https://doi.org/10.1109/FG57933.2023.10042802","url":null,"abstract":"Humans can easily perceive numerous information from faces, only part of which has been achieved by a machine, thanks to the availability of large-scale face images with supervision signals of those specific tasks. More face perception tasks, like rare expression or attribute recognition, and genetic syndrome diagnosis, are not solved due to a critical shortage of supervised data. One possible way to solve these tasks is leveraging ubiquitous large-scale unsupervised face images and building a foundation face model via methods like contrastive learning (CL), which is, however, not aware of the intrinsic physics of the human face. In consideration of this shortcoming, this paper proposes to enhance contrastive face representation learning by the physical imaging model. Specifically, besides the CL-backbone network, we also design an auxiliary bypass pathway to constrain the CL-backbone to support the ability of accurately re-rendering the face with a differentiable physical imaging model after decomposing an input face image into intrinsic 3D imaging factors. With this design, the CL network is endowed the capacity of implicitly “knowing” the 3D of the face rather than the 2D pixels only. In experiments, we learn face representations from the CelebA and WebFace-42M datasets in unsupervised mode and evaluate the generalization capability of the representations with three different downstream tasks in case of limited supervised data. The experimental results clearly justify the effectiveness of the proposed method.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122256128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchically Organized Computer Vision in Support of Multi-Faceted Search for Missing Persons","authors":"Arturo Miguel Russell Bernal, Jane Cleland-Huang","doi":"10.1109/FG57933.2023.10042698","DOIUrl":"https://doi.org/10.1109/FG57933.2023.10042698","url":null,"abstract":"Missing person searches are typically initiated with a description of a person that includes their age, race, clothing, and gender, possibly supported by a photo. Unmanned Aerial Systems (sUAS) imbued with Computer Vision (CV) capabilities, can be deployed to quickly search an area to find the missing person; however, the search task is far more difficult when a crowd of people is present, and only the person described in the missing person report must be identified. It is particularly challenging to perform this task on the potentially limited resources of an sUAS. We therefore propose AirSight, as a new model that hierarchically combines multiple CV models, exploits both onboard and off-board computing capabilities, and engages humans interactively in the search. For illustrative purposes, we use AirSight to show how a person's image, extracted from an aerial video can be matched to a basic description of the person. Finally, as a work-in-progress paper, we describe ongoing efforts in building an aerial dataset of partially occluded people and physically deploying AirSight on our sUAS.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124725349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing Interactions in Paired Egocentric Videos","authors":"A. Khatri, Zachary Butler, Ifeoma Nwogu","doi":"10.1109/FG57933.2023.10042654","DOIUrl":"https://doi.org/10.1109/FG57933.2023.10042654","url":null,"abstract":"As wearable devices become more popular, ego-centric information recorded with these devices can be used to better understand the behaviors of the wearer and other people the wearer is interacting with. Data such as the voice, head movement, galvanic skin responses (GSR) to measure arousal levels, etc., obtained from such devices can provide a window into the underlying affect of both the wearer and his/her conversant. In this study, we examine the characteristics of two types of dyadic conversations. In one case, the interlocutors discuss a topic on which they agree, while the other situation involves interlocutors discussing a topic on which they disagree, even if they are friends. The range of topics is mostly politically motivated. The egocentric information is collected using a pair of wearable smart glasses for video data and a smart wristband for physiological data, including GSR. Using this data, various features are extracted including the facial expressions of the conversant and the 3D motion from the wearer's camera within the environment - this motion is termed as egomotion. The goal of this work is to investigate whether the nature of a discussion could be better determined either by evaluating the behavior of an individual in the conversation or by evaluating the pairing/coupling of the behaviors of the two people in the conversation. The pairing is accomplished using a modified formulation of the dynamic time warping (DTW) algorithm. A random forest classifier is implemented to evaluate the nature of the interaction (agreement versus disagreement) using individualistic and paired features separately. The study found that in the presence of the limited data used in this work, individual behaviors were slightly more indicative of the type of discussion (85.43% accuracy) than the paired behaviors (83.33% accuracy).","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124240050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning to focus on region-of-interests for pain intensity estimation","authors":"Manh-Tu Vu, M. Beurton-Aimar","doi":"10.1109/FG57933.2023.10042583","DOIUrl":"https://doi.org/10.1109/FG57933.2023.10042583","url":null,"abstract":"The breakthrough success of many deep learning approaches is mainly due to the availability of large-scale labeled datasets. However, large-scale labeled datasets are not always available in some domains. Pain intensity estimation is unsurprisingly one those domains that suffer from lacking of labeled training data. In this work, we proposed a learning approach that is able to learn to focus on region-of-interests in face image for better feature extraction, thus improving overall performance of the network when training on a limited amount of data. Our extensive experiments demonstrate that our learning to focus on region-of-interests approach performs better in overall compared to state-of-the-art approaches in pain intensity estimation. From the experimental results, we emphasise the importance of learning to focus on region-of-interests for better extracting feature representations and reducing the effect of overfitting when training on a limited amount of data.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126532489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Continuous Mesh Representation with Spherical Implicit Surface","authors":"Zhong Gao","doi":"10.1109/FG57933.2023.10042514","DOIUrl":"https://doi.org/10.1109/FG57933.2023.10042514","url":null,"abstract":"As the most common representation for 3D shapes, mesh is often stored discretely with arrays of vertices and faces. However, 3D shapes in the real world are presented continuously. In this paper, we propose to learn a continuous representation for meshes with fixed topology, a common and practical setting in many faces-, hand-, and body-related applications. First, we split the template into multiple closed manifold genus-0 meshes so that each genus-0 mesh can be parameterized onto the unit sphere. Then we learn spherical implicit surface (SIS), which takes a spherical coordinate and a global feature or a set of local features around the coordinate as inputs, predicting the vertex corresponding to the coordinate as an output. Since the spherical coordinates are continuous, SIS can depict a mesh in an arbitrary resolution. SIS representation builds a bridge between discrete and continuous representation in 3D shapes. Specifically, we train SIS networks in a self-supervised manner for two tasks: a reconstruction task and a super-resolution task. Experiments show that our SIS representation is comparable with state-of-the-art methods that are specifically designed for meshes with a fixed resolution and significantly outperforms methods that work in arbitrary resolutions.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123586900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"STr-GCN: Dual Spatial Graph Convolutional Network and Transformer Graph Encoder for 3D Hand Gesture Recognition","authors":"Rim Slama, W. Rabah, H. Wannous","doi":"10.1109/FG57933.2023.10042643","DOIUrl":"https://doi.org/10.1109/FG57933.2023.10042643","url":null,"abstract":"Skeleton-based hand gesture recognition is a challenging task that sparked a lot of attention in recent years, especially with the rise of Graph Neural Networks. In this paper, we propose a new deep learning architecture for hand gesture recognition using 3D hand skeleton data and we call STr-GCN. It decouples the spatial and temporal learning of the gesture by leveraging Graph Convolutional Networks (GCN) and Transformers. The key idea is to combine two powerful networks: a Spatial Graph Convolutional Network unit that understands intra-frame interactions to extract powerful features from different hand joints and a Transformer Graph Encoder which is based on a Temporal Self-Attention module to incorporate inter-frame correlations. We evaluate the performance of our method on three benchmarks: the SHREC'17 Track dataset, Briareo dataset and the First Person Hand Action dataset. The experiments show the efficiency of our approach, which achieves or outperforms the state of the art. The code to reproduce our results is available in this link.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121342161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Mental Prototypes by an Efficient Interdisciplinary Approach: Interactive Microbial Genetic Algorithm","authors":"Sen Yan, Catherine Soladié, R. Séguier","doi":"10.1109/FG57933.2023.10042515","DOIUrl":"https://doi.org/10.1109/FG57933.2023.10042515","url":null,"abstract":"Facial expression-based technologies have flooded our daily lives. However, most technologies are limited to Ekman's basic facial expressions and rarely deal with more than ten emotional states. This is not only due to the lack of prototypes for complex emotions but also the time-consuming and laborious task of building an extensive labeled database. To remove these obstacles, we were inspired by a psychophysical approach for affective computing, so-called the reverse correlation process (RevCor), to extract mental prototypes of what a given emotion should look like for an observer. We proposed a novel, efficient, and interdisciplinary approach called Interactive Microbial Genetic Algorithm (IMGA) by integrating the concepts of RevCor into an interactive genetic algorithm (IGA). Our approach achieves four challenges: online feedback loop, expertise-free, velocity, and diverse results. Experimental results show that for each observer, with limited trials, our approach can provide diverse mental prototypes for both basic emotions and emotions that are not available in existing deep-learning databases. Our work is available at https://yansen0508.github.io/Interactive-Microbial-Genetic-Algorithm/.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126406740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-supervised Learning for Fine-grained Ethnicity Classification under Limited Labeled Data","authors":"Kun Li, Jie Zhang, S. Shan","doi":"10.1109/FG57933.2023.10042748","DOIUrl":"https://doi.org/10.1109/FG57933.2023.10042748","url":null,"abstract":"Human faces are always determined by genes and other external causes, such as geographical environment, which makes it possible for us to predict ethnicity according to the faces. However, it remains a challenging task due to the tiny differences in faces for various ethnicities, which is hard for human beings to tell, especially for ethnicities on the same continent, e.g., East Asia. Although some strongly-supervised methods have demonstrated their feasibility in this task, they cease to be effective when suffering from data-hungry issues in practice. This paper proposes a novel self-supervised model with a polynomial stacked attention mechanism to well excavate distinctions across different nations under limited labeled data. And we also construct a new ethnicity dataset named Cupid which observably extends the scale and categories of ethnic data compared to the existing datasets. Extensive experiments confirm that our method achieves the state-of-the-art results on both the Asian Face dataset and our proposed Cupid dataset.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128390114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-Resolution Face Recognition Enhanced by High-Resolution Facial Images","authors":"Haihan Wang, Shangfei Wang","doi":"10.1109/FG57933.2023.10042552","DOIUrl":"https://doi.org/10.1109/FG57933.2023.10042552","url":null,"abstract":"Despite recent advances in high-resolution (HR) face recognition, recognizing identities from low-resolution (LR) facial images remains challenging due to the absence of facial shape and detail. Current research focuses solely on reducing the distribution discrepancy between the HR and LR embeddings from the output layer, rather than thoroughly investigating the superiority of HR facial images for improved performance. In this paper, we propose a novel low-resolution face recognition method enhanced by the guidance of high-resolution facial images in both feature map space and embedding space. Specifically, in feature map space, the similarity constraint across the multilayer feature maps is adopted to align the intermediate features of facial images. Then we introduce multiple generators to recover HR images from extracted feature maps and utilize the reconstructed loss to supplement the missing facial details in LR images. In embedding space, we propose a supervised auxiliary contrastive loss to encourage the paired HR and LR embedding from the same class to be pulled together, whereas those from different classes are pushed apart. The one-to-many matching strategy and the adaptive weight adjustment strategy are applied to make the network adapt to the inputs of different resolutions. Experiments on four benchmark datasets with both synthesized and realistic LR facial images demonstrate the superiority of the proposed method to state-of-the-art.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130687852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning from What is Already Out There: Few-shot Sign Language Recognition with Online Dictionaries","authors":"Matyáš Boháček, M. Hrúz","doi":"10.1109/FG57933.2023.10042544","DOIUrl":"https://doi.org/10.1109/FG57933.2023.10042544","url":null,"abstract":"Today's sign language recognition models require large training corpora of laboratory-like videos, whose collection involves an extensive workforce and financial resources. As a result, only a handful of such systems are publicly available, not to mention their limited localization capabilities for less-populated sign languages. Utilizing online text-to-video dictionaries, which inherently hold annotated data of various attributes and sign languages, and training models in a few-shot fashion hence poses a promising path for the democratization of this technology. In this work, we collect and open-source the UWB-SL-Wild few-shot dataset, the first of its kind training resource consisting of dictionary-scraped videos. This dataset represents the actual distribution and characteristics of available online sign language data. We select glosses that directly overlap with the already existing datasets WLASL100 and ASLLVD and share their class mappings to allow for transfer learning experiments. Apart from providing baseline results on a pose-based architecture, we introduce a novel approach to training sign language recognition models in a few-shot scenario, resulting in state-of-the-art results on ASLLVD-Skeleton and ASLLVD-Skeleton-20 datasets with top- 1 accuracy of 30.97 % and 95.45 %, respectively.","PeriodicalId":318766,"journal":{"name":"2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)","volume":"47 9","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132511371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}