{"title":"Domain Adaptation for Head Pose Estimation Using Relative Pose Consistency","authors":"Felix Kuhnke;Jörn Ostermann","doi":"10.1109/TBIOM.2023.3237039","DOIUrl":"https://doi.org/10.1109/TBIOM.2023.3237039","url":null,"abstract":"Head pose estimation plays a vital role in biometric systems related to facial and human behavior analysis. Typically, neural networks are trained on head pose datasets. Unfortunately, manual or sensor-based annotation of head pose is impractical. A solution is synthetic training data generated from 3D face models, which can provide an infinite number of perfect labels. However, computer generated images only provide an approximation of real-world images, leading to a performance gap between training and application domain. Therefore, there is a need for strategies that allow simultaneous learning on labeled synthetic data and unlabeled real-world data to overcome the domain gap. In this work we propose relative pose consistency, a semi-supervised learning strategy for head pose estimation based on consistency regularization. Consistency regularization enforces consistent network predictions under random image augmentations, including pose-preserving and pose-altering augmentations. We propose a strategy to exploit the relative pose introduced by pose-altering augmentations between augmented image pairs, to allow the network to benefit from relative pose labels during training on unlabeled data. We evaluate our approach in a domain-adaptation scenario and in a commonly used cross-dataset scenario. Furthermore, we reproduce related works to enforce consistent evaluation protocols and show that for both scenarios we outperform SOTA.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 3","pages":"348-359"},"PeriodicalIF":0.0,"publicationDate":"2023-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8423754/10210132/10021684.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49989782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meng-Tzu Chiu;Hsun-Ying Cheng;Chien-Yi Wang;Shang-Hong Lai
{"title":"RGB-D Face Recognition With Identity-Style Disentanglement and Depth Augmentation","authors":"Meng-Tzu Chiu;Hsun-Ying Cheng;Chien-Yi Wang;Shang-Hong Lai","doi":"10.1109/TBIOM.2022.3233769","DOIUrl":"https://doi.org/10.1109/TBIOM.2022.3233769","url":null,"abstract":"Deep learning approaches achieve highly accurate face recognition by training the models with huge face image datasets. Unlike 2D face image datasets, there is a lack of large 3D face datasets available to the public. Existing public 3D face datasets were usually collected with few subjects, leading to the over-fitting problem. This paper proposes two CNN models to improve the RGB-D face recognition task. The first is a segmentation-aware depth estimation network, called DepthNet, which estimates depth maps from RGB face images by exploiting semantic segmentation for more accurate face region localization. The other is a novel segmentation-guided RGB-D face recognition model that contains an RGB recognition branch, a depth map recognition branch, and an auxiliary segmentation mask branch. In our multi-modality face recognition model, a feature disentanglement scheme is employed to factorize the feature representation into identity-related and style-related components. DepthNet is applied to augment a large 2D face image dataset to a large RGB-D face dataset, which is used for training our RGB-D face recognition model. Our experimental results show that DepthNet can produce more reliable depth maps from face images with the segmentation mask. Our multi-modality face recognition model fully exploits the depth map and outperforms state-of-the-art methods on several public 3D face datasets with challenging variations.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 3","pages":"334-347"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49989784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio–Visual Fusion for Emotion Recognition in the Valence–Arousal Space Using Joint Cross-Attention","authors":"R. Gnana Praveen;Patrick Cardinal;Eric Granger","doi":"10.1109/TBIOM.2022.3233083","DOIUrl":"https://doi.org/10.1109/TBIOM.2022.3233083","url":null,"abstract":"Automatic emotion recognition (ER) has recently gained much interest due to its potential in many real-world applications. In this context, multimodal approaches have been shown to improve performance (over unimodal approaches) by combining diverse and complementary sources of information, providing some robustness to noisy and missing modalities. In this paper, we focus on dimensional ER based on the fusion of facial and vocal modalities extracted from videos, where complementary audio-visual (A-V) relationships are explored to predict an individual’s emotional states in valence-arousal space. Most state-of-the-art fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. To address this problem, we introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities, and allows to effectively leverage the inter-modal relationships, while retaining the intra-modal relationships. In particular, it computes the cross-attention weights based on correlation between the joint feature representation and that of individual modalities. Deploying the joint A-V feature representation into the cross-attention module helps to simultaneously leverage both the intra and inter modal relationships, thereby significantly improving the performance of the system over the vanilla cross-attention module. The effectiveness of our proposed approach is validated experimentally on challenging videos from the RECOLA and AffWild2 datasets. Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches, even when the modalities are noisy or absent. Code is available at \u0000<uri>https://github.com/praveena2j/Joint-Cross-Attention-for-Audio-Visual-Fusion</uri>\u0000.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 3","pages":"360-373"},"PeriodicalIF":0.0,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49989783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Zhou;Xuanlin Min;Yiheng Zhao;Yiran Pang;Jun Yi
{"title":"A Multi-Scale Spatio-Temporal Network for Violence Behavior Detection","authors":"Wei Zhou;Xuanlin Min;Yiheng Zhao;Yiran Pang;Jun Yi","doi":"10.1109/TBIOM.2022.3233399","DOIUrl":"https://doi.org/10.1109/TBIOM.2022.3233399","url":null,"abstract":"Violence behavior detection has played an important role in computer vision, its widely used in unmanned security monitoring systems, Internet video filtration, etc. However, automatically detecting violence behavior from surveillance cameras has long been a challenging issue due to the real-time and detection accuracy. In this brief, a novel multi-scale spatio-temporal network termed as MSTN is proposed to detect violence behavior from video stream. To begin with, the spatio-temporal feature extraction module (STM) is developed to extract the key features between foreground and background of the original video. Then, temporal pooling and cross channel pooling are designed to obtain short frame rate and long frame rate from STM, respectively. Furthermore, short-time building (STB) branch and long-time building (LTB) branch are presented to extract the violence features from different spatio-temporal scales, where STB module is used to capture the spatial feature and LTB module is used to extract useful temporal feature for video recognition. Finally, a Trans module is presented to fuse the features of STB and LTB through lateral connection operation, where LTB feature is compressed into STB to improve the accuracy. Experimental results show the effectiveness and superiority of the proposed method on computational efficiency and detection accuracy.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 2","pages":"266-276"},"PeriodicalIF":0.0,"publicationDate":"2023-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49989195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Biometrics, Behavior, and Identity Science Publication Information","authors":"","doi":"10.1109/TBIOM.2022.3226338","DOIUrl":"https://doi.org/10.1109/TBIOM.2022.3226338","url":null,"abstract":"Presents a listing of the editorial board, board of governors, current staff, committee members, and/or society editors for this issue of the publication.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 1","pages":"C2-C2"},"PeriodicalIF":0.0,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8423754/9997805/09997808.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49950246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IEEE Transactions on Biometrics, Behavior, and Identity Science Information for Authors","authors":"","doi":"10.1109/TBIOM.2022.3226339","DOIUrl":"https://doi.org/10.1109/TBIOM.2022.3226339","url":null,"abstract":"These instructions give guidelines for preparing papers for this publication. Presents information for authors publishing in this journal.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 1","pages":"C3-C3"},"PeriodicalIF":0.0,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8423754/9997805/09997806.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49950184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Context Grouped Attention for Unsupervised Person Re-Identification","authors":"Kshitij Nikhal;Benjamin S. Riggan","doi":"10.1109/TBIOM.2022.3226678","DOIUrl":"https://doi.org/10.1109/TBIOM.2022.3226678","url":null,"abstract":"Recent advancements like multiple contextual analysis, attention mechanisms, distance-aware optimization, and multi-task guidance have been widely used for supervised person re-identification (ReID), but the implementation and effects of such methods in unsupervised ReID frameworks are non-trivial and unclear, respectively. Moreover, with increasing size and complexity of image- and video-based ReID datasets, manual or semi-automated annotation procedures for supervised ReID are becoming labor intensive and cost prohibitive, which is undesirable especially considering the likelihood of annotation errors increase with scale/complexity of data collections. Therefore, we propose a new iterative clustering framework that is insensitive to annotation errors and over-fitting ReID annotations (i.e., labels). Our proposed unsupervised framework incorporates (a) a novel multi-context group attention architecture that learns a holistic attention map from multiple local and global contexts, (b) an unsupervised clustering loss function that down-weights easily discriminative identities, and (c) a background diversity term that helps cluster persons across different cross-camera views without leveraging any identification or camera labels. We perform extensive analysis using the DukeMTMC-VideoReID and MARS video-based ReID datasets and the MSMT17 image-based ReID dataset. Our approach is shown to provide a new state-of-the-art performance for unsupervised ReID, reducing the rank-1 performance gap between supervised and unsupervised ReID to 1.1%, 12.1%, and 21.9% from 6.1%, 17.9%, and 22.6% for DukeMTMC, MARS, and MSMT17 datasets, respectively.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 2","pages":"170-182"},"PeriodicalIF":0.0,"publicationDate":"2022-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49964209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Single-Sample Finger Vein Recognition via Competitive and Progressive Sparse Representation","authors":"Pengyang Zhao;Zhiquan Chen;Jing-Hao Xue;Jianjiang Feng;Wenming Yang;Qingmin Liao;Jie Zhou","doi":"10.1109/TBIOM.2022.3226270","DOIUrl":"https://doi.org/10.1109/TBIOM.2022.3226270","url":null,"abstract":"As an emerging biometric technology, finger vein recognition has attracted much attention in recent years. However, single-sample recognition is a practical and longstanding challenge in this field, referring to only one finger vein image per class in the training set. In single-sample finger vein recognition, the illumination variations under low contrast and the lack of information of intra-class variations severely affect the recognition performance. Despite of its high robustness against noise and illumination variations, sparse representation has rarely been explored for single-sample finger vein recognition. Therefore, in this paper, we focus on developing a new approach called Progressive Sparse Representation Classification (PSRC) to address the challenging issue of single-sample finger vein recognition. Firstly, as residual may become too large under the scenario of single-sample finger vein recognition, we propose a progressive strategy for representation refinement of SRC. Secondly, to adaptively optimize progressions, a progressive index called Max Energy Residual Index (MERI) is defined as the guidance. Furthermore, we extend PSRC to bimodal biometrics and propose a Competitive PSRC (C-PSRC) fusion approach. The C-PSRC creates more discriminative fused sample and fusion dictionary by comparing residual errors of different modalities. By comparing with several state-of-the-art methods on three finger vein benchmarks, the superiority of the proposed PSRC and C-PSRC is clearly demonstrated.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 2","pages":"209-220"},"PeriodicalIF":0.0,"publicationDate":"2022-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49964208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"iGROWL: Improved Group Detection With Link Prediction","authors":"Viktor Schmuck;Oya Celiktutan","doi":"10.1109/TBIOM.2022.3225654","DOIUrl":"https://doi.org/10.1109/TBIOM.2022.3225654","url":null,"abstract":"One of the main challenges robots need to overcome is crowd analysis. Crowd analysis deals with the detection of individuals and interaction groups as well as the recognition of their activities. This paper focuses on the detection of conversational groups, where there have been a number of approaches addressing this problem in both supervised and unsupervised ways. Supervised bottom-up approaches primarily relied on pairwise affinity matrices and were limited to static, third-person views. In this work, we present our approach based on Graph Neural Networks (GNNs) to the problem of interaction group detection, called improved Group Detection With Link Prediction (iGROWL). iGROWL utilises the fact that interaction groups exist in certain inherent spatial configurations and improves its predecessor, GROWL, by introducing an ensemble learning-based sample balancing technique to the algorithm. Our results show that iGROWL outperforms other state-of-the-art methods by 16.7% and 26.4% in terms of \u0000<inline-formula> <tex-math>$F_{1}$ </tex-math></inline-formula>\u0000-score when evaluated on the Salsa Poster Session and Cocktail Party datasets, respectively. Moreover, we show that sample balancing with GNNs is not trivial, but consistent results can be achieved by employing ensemble learning.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 3","pages":"400-410"},"PeriodicalIF":0.0,"publicationDate":"2022-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49989777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generating 2-D and 3-D Master Faces for Dictionary Attacks With a Network-Assisted Latent Space Evolution","authors":"Tomer Friedlander;Ron Shmelkin;Lior Wolf","doi":"10.1109/TBIOM.2022.3223738","DOIUrl":"https://doi.org/10.1109/TBIOM.2022.3223738","url":null,"abstract":"A master face is a face image that passes face-based identity authentication for a high percentage of the population. These faces can be used to impersonate, with a high probability of success, any user, without having access to any user information. We optimize these faces for 2D and 3D face verification models, by using an evolutionary algorithm in the latent embedding space of the StyleGAN face generator. For 2D face verification, multiple evolutionary strategies are compared, and we propose a novel approach that employs a neural network to direct the search toward promising samples, without adding fitness evaluations. The results we present demonstrate that it is possible to obtain a considerable coverage of the identities in the LFW or RFW datasets with less than 10 master faces, for six leading deep face recognition systems. In 3D, we generate faces using the 2D StyleGAN2 generator and predict a 3D structure using a deep 3D face reconstruction network. When employing two different 3D face recognition systems, we are able to obtain a coverage of 40%-50%. Additionally, we present the generation of paired 2D RGB and 3D master faces, which simultaneously match 2D and 3D models with high impersonation rates.","PeriodicalId":73307,"journal":{"name":"IEEE transactions on biometrics, behavior, and identity science","volume":"5 3","pages":"385-399"},"PeriodicalIF":0.0,"publicationDate":"2022-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49989778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}