Derry Pramono Adi, E. M. Yuniarno, D. P. Wulandari
{"title":"基于人脸网格和卷积神经网络预训练模型的视频帧序列情感识别","authors":"Derry Pramono Adi, E. M. Yuniarno, D. P. Wulandari","doi":"10.1109/ISITIA59021.2023.10221117","DOIUrl":null,"url":null,"abstract":"Emotions are a collection of subjective cognitive experiences and psychological and physiological characteristics that express a wide range of feelings, thoughts, and behaviors in human interaction. Emotions can be represented through several means, such as facial expressions, tone of voice, and behavior. Deep Learning (DL) research has focused on incorporating facial expressions. Images with facial expressions are commonly used as data input for the DL model. Unfortunately, most DL models in Facial Emotion Recognition (FER) use static images. This method does not take into consideration all conceivable facial expressions. The static image of facial expressions is insufficient for recognizing emotions, but a sequential image from a video is required. In this study, we extract MediaPipe’s face mesh feature, the state-of-the-art multidimensional expression key points embedded in the video image sequence. Furthermore, we feed sequence image data into the pre-trained Convolutional Neural Network (CNN) model. The data we used is from The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) with the emotion classes of “Anger,” “Fearful,” “Happy,” and “Sad.” For this specific FER task, we found that the best pre-trained CNN model achieved 92.8% accuracy (using the VGG-19 model), with the fastest runtime of $\\sim2.3$ seconds (achieved using the SqueezeNet model).","PeriodicalId":116682,"journal":{"name":"2023 International Seminar on Intelligent Technology and Its Applications (ISITIA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Emotion Recognition from Video Frame Sequence using Face Mesh and Pre-Trained Models of Convolutional Neural Network\",\"authors\":\"Derry Pramono Adi, E. M. Yuniarno, D. P. Wulandari\",\"doi\":\"10.1109/ISITIA59021.2023.10221117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotions are a collection of subjective cognitive experiences and psychological and physiological characteristics that express a wide range of feelings, thoughts, and behaviors in human interaction. Emotions can be represented through several means, such as facial expressions, tone of voice, and behavior. Deep Learning (DL) research has focused on incorporating facial expressions. Images with facial expressions are commonly used as data input for the DL model. Unfortunately, most DL models in Facial Emotion Recognition (FER) use static images. This method does not take into consideration all conceivable facial expressions. The static image of facial expressions is insufficient for recognizing emotions, but a sequential image from a video is required. In this study, we extract MediaPipe’s face mesh feature, the state-of-the-art multidimensional expression key points embedded in the video image sequence. Furthermore, we feed sequence image data into the pre-trained Convolutional Neural Network (CNN) model. The data we used is from The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) with the emotion classes of “Anger,” “Fearful,” “Happy,” and “Sad.” For this specific FER task, we found that the best pre-trained CNN model achieved 92.8% accuracy (using the VGG-19 model), with the fastest runtime of $\\\\sim2.3$ seconds (achieved using the SqueezeNet model).\",\"PeriodicalId\":116682,\"journal\":{\"name\":\"2023 International Seminar on Intelligent Technology and Its Applications (ISITIA)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Seminar on Intelligent Technology and Its Applications (ISITIA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISITIA59021.2023.10221117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Seminar on Intelligent Technology and Its Applications (ISITIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISITIA59021.2023.10221117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Emotion Recognition from Video Frame Sequence using Face Mesh and Pre-Trained Models of Convolutional Neural Network
Emotions are a collection of subjective cognitive experiences and psychological and physiological characteristics that express a wide range of feelings, thoughts, and behaviors in human interaction. Emotions can be represented through several means, such as facial expressions, tone of voice, and behavior. Deep Learning (DL) research has focused on incorporating facial expressions. Images with facial expressions are commonly used as data input for the DL model. Unfortunately, most DL models in Facial Emotion Recognition (FER) use static images. This method does not take into consideration all conceivable facial expressions. The static image of facial expressions is insufficient for recognizing emotions, but a sequential image from a video is required. In this study, we extract MediaPipe’s face mesh feature, the state-of-the-art multidimensional expression key points embedded in the video image sequence. Furthermore, we feed sequence image data into the pre-trained Convolutional Neural Network (CNN) model. The data we used is from The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) with the emotion classes of “Anger,” “Fearful,” “Happy,” and “Sad.” For this specific FER task, we found that the best pre-trained CNN model achieved 92.8% accuracy (using the VGG-19 model), with the fastest runtime of $\sim2.3$ seconds (achieved using the SqueezeNet model).