Zheng Sun, Andrew W. Sumsion, Shad A. Torrie, Dah-Jye Lee
{"title":"Learn Dynamic Facial Motion Representations Using Transformer Encoder","authors":"Zheng Sun, Andrew W. Sumsion, Shad A. Torrie, Dah-Jye Lee","doi":"10.1109/ietc54973.2022.9796917","DOIUrl":null,"url":null,"abstract":"Human face analysis is an essential topic in visual computing. Many of our daily applications, such as face-priority auto focus in camera, face-based identity verification, and TikTok stickers, are unattainable without face analysis techniques. In the past ten years, face-related visual computing tasks like face detection, face recognition, and facial expression classification have improved drastically in performance, benefiting from the rapid development of deep learning theory. This work explores how to model dynamic facial motion using a learning-based method. Our proposed model takes video clips containing customized facial motion as input and generates a uni-size vector (the embedding) as the output. We have inspected two different encoders–recurrent neural networks and transformers to extract the temporal features from the video clip. We collected our own facial motion analysis dataset because there is no suitable datasets for our facial motion analysis task. Although our domain-specific dataset is small compared to the well-known public datasets for ordinary face-related tasks, we adopt a transfer learning approach, and a data augmentation method (random trimming) to help the model converge. The experimental results show that the transformer-based encoder performs better than the RNN baseline, and the best F1-score with our validation data is 0.889.","PeriodicalId":251518,"journal":{"name":"2022 Intermountain Engineering, Technology and Computing (IETC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Intermountain Engineering, Technology and Computing (IETC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ietc54973.2022.9796917","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Human face analysis is an essential topic in visual computing. Many of our daily applications, such as face-priority auto focus in camera, face-based identity verification, and TikTok stickers, are unattainable without face analysis techniques. In the past ten years, face-related visual computing tasks like face detection, face recognition, and facial expression classification have improved drastically in performance, benefiting from the rapid development of deep learning theory. This work explores how to model dynamic facial motion using a learning-based method. Our proposed model takes video clips containing customized facial motion as input and generates a uni-size vector (the embedding) as the output. We have inspected two different encoders–recurrent neural networks and transformers to extract the temporal features from the video clip. We collected our own facial motion analysis dataset because there is no suitable datasets for our facial motion analysis task. Although our domain-specific dataset is small compared to the well-known public datasets for ordinary face-related tasks, we adopt a transfer learning approach, and a data augmentation method (random trimming) to help the model converge. The experimental results show that the transformer-based encoder performs better than the RNN baseline, and the best F1-score with our validation data is 0.889.