基于人脸网格和卷积神经网络预训练模型的视频帧序列情感识别

Derry Pramono Adi, E. M. Yuniarno, D. P. Wulandari
{"title":"基于人脸网格和卷积神经网络预训练模型的视频帧序列情感识别","authors":"Derry Pramono Adi, E. M. Yuniarno, D. P. Wulandari","doi":"10.1109/ISITIA59021.2023.10221117","DOIUrl":null,"url":null,"abstract":"Emotions are a collection of subjective cognitive experiences and psychological and physiological characteristics that express a wide range of feelings, thoughts, and behaviors in human interaction. Emotions can be represented through several means, such as facial expressions, tone of voice, and behavior. Deep Learning (DL) research has focused on incorporating facial expressions. Images with facial expressions are commonly used as data input for the DL model. Unfortunately, most DL models in Facial Emotion Recognition (FER) use static images. This method does not take into consideration all conceivable facial expressions. The static image of facial expressions is insufficient for recognizing emotions, but a sequential image from a video is required. In this study, we extract MediaPipe’s face mesh feature, the state-of-the-art multidimensional expression key points embedded in the video image sequence. Furthermore, we feed sequence image data into the pre-trained Convolutional Neural Network (CNN) model. The data we used is from The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) with the emotion classes of “Anger,” “Fearful,” “Happy,” and “Sad.” For this specific FER task, we found that the best pre-trained CNN model achieved 92.8% accuracy (using the VGG-19 model), with the fastest runtime of $\\sim2.3$ seconds (achieved using the SqueezeNet model).","PeriodicalId":116682,"journal":{"name":"2023 International Seminar on Intelligent Technology and Its Applications (ISITIA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Emotion Recognition from Video Frame Sequence using Face Mesh and Pre-Trained Models of Convolutional Neural Network\",\"authors\":\"Derry Pramono Adi, E. M. Yuniarno, D. P. Wulandari\",\"doi\":\"10.1109/ISITIA59021.2023.10221117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emotions are a collection of subjective cognitive experiences and psychological and physiological characteristics that express a wide range of feelings, thoughts, and behaviors in human interaction. Emotions can be represented through several means, such as facial expressions, tone of voice, and behavior. Deep Learning (DL) research has focused on incorporating facial expressions. Images with facial expressions are commonly used as data input for the DL model. Unfortunately, most DL models in Facial Emotion Recognition (FER) use static images. This method does not take into consideration all conceivable facial expressions. The static image of facial expressions is insufficient for recognizing emotions, but a sequential image from a video is required. In this study, we extract MediaPipe’s face mesh feature, the state-of-the-art multidimensional expression key points embedded in the video image sequence. Furthermore, we feed sequence image data into the pre-trained Convolutional Neural Network (CNN) model. The data we used is from The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) with the emotion classes of “Anger,” “Fearful,” “Happy,” and “Sad.” For this specific FER task, we found that the best pre-trained CNN model achieved 92.8% accuracy (using the VGG-19 model), with the fastest runtime of $\\\\sim2.3$ seconds (achieved using the SqueezeNet model).\",\"PeriodicalId\":116682,\"journal\":{\"name\":\"2023 International Seminar on Intelligent Technology and Its Applications (ISITIA)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Seminar on Intelligent Technology and Its Applications (ISITIA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISITIA59021.2023.10221117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Seminar on Intelligent Technology and Its Applications (ISITIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISITIA59021.2023.10221117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

情感是一种主观认知体验和心理生理特征的集合,它表达了人类互动中广泛的感觉、思想和行为。情绪可以通过几种方式表现出来,比如面部表情、语调和行为。深度学习(DL)研究一直专注于面部表情的融合。带有面部表情的图像通常被用作深度学习模型的数据输入。不幸的是,大多数深度学习模型在面部情绪识别(FER)使用静态图像。这种方法并没有考虑到所有可能的面部表情。面部表情的静态图像不足以识别情绪,但需要来自视频的连续图像。在本研究中,我们提取了MediaPipe的人脸网格特征,即嵌入在视频图像序列中的最先进的多维表达关键点。此外,我们将序列图像数据输入预训练的卷积神经网络(CNN)模型。我们使用的数据来自瑞尔森情感语言和歌曲视听数据库(RAVDESS),情感类别为“愤怒”,“恐惧”,“快乐”和“悲伤”。对于这个特定的FER任务,我们发现最好的预训练CNN模型达到了92.8%的准确率(使用VGG-19模型),最快的运行时间为$\sim2.3$秒(使用SqueezeNet模型实现)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Emotion Recognition from Video Frame Sequence using Face Mesh and Pre-Trained Models of Convolutional Neural Network
Emotions are a collection of subjective cognitive experiences and psychological and physiological characteristics that express a wide range of feelings, thoughts, and behaviors in human interaction. Emotions can be represented through several means, such as facial expressions, tone of voice, and behavior. Deep Learning (DL) research has focused on incorporating facial expressions. Images with facial expressions are commonly used as data input for the DL model. Unfortunately, most DL models in Facial Emotion Recognition (FER) use static images. This method does not take into consideration all conceivable facial expressions. The static image of facial expressions is insufficient for recognizing emotions, but a sequential image from a video is required. In this study, we extract MediaPipe’s face mesh feature, the state-of-the-art multidimensional expression key points embedded in the video image sequence. Furthermore, we feed sequence image data into the pre-trained Convolutional Neural Network (CNN) model. The data we used is from The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) with the emotion classes of “Anger,” “Fearful,” “Happy,” and “Sad.” For this specific FER task, we found that the best pre-trained CNN model achieved 92.8% accuracy (using the VGG-19 model), with the fastest runtime of $\sim2.3$ seconds (achieved using the SqueezeNet model).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信