Estimating the Intensity of Facial Expressions Accompanying Feedback Responses in Multiparty Video-Mediated Communication

Ryosuke Ueno, Y. Nakano, Jie Zeng, Fumio Nihei
{"title":"Estimating the Intensity of Facial Expressions Accompanying Feedback Responses in Multiparty Video-Mediated Communication","authors":"Ryosuke Ueno, Y. Nakano, Jie Zeng, Fumio Nihei","doi":"10.1145/3382507.3418878","DOIUrl":null,"url":null,"abstract":"Providing feedback to a speaker is an essential communication signal for maintaining a conversation. In specific feedback, which indicates the listener's reaction to the speaker?s utterances, the facial expression is an effective modality for conveying the listener's reactions. Moreover, not only the type of facial expressions, but also the degree of intensity of the expressions, may influence the meaning of the specific feedback. In this study, we propose a multimodal deep neural network model that predicts the intensity of facial expressions co-occurring with feedback responses. We focus on multiparty video-mediated communication. In video-mediated communication, close-up frontal face images of each participant are continuously presented on the display; the attention of the participants is more likely to be drawn to the facial expressions. We assume that in such communication, the importance of facial expression in the listeners? feedback responses increases. We collected 33 video-mediated conversations by groups of three people and obtained audio and speech data for each participant. Using the corpus collected as a dataset, we created a deep neural network model that predicts the intensity of 17 types of action units (AUs) co-occurring with the feedback responses. The proposed method employed GRU-based model with attention mechanism for audio, visual, and language modalities. A decoder was trained to produce the intensity values for the 17 AUs frame by frame. In the experiment, unimodal and multimodal models were compared in terms of their performance in predicting salient AUs that characterize facial expression in feedback responses. The results suggest that well-performing models differ depending on the AU categories; audio information was useful for predicting AUs that express happiness, and visual and language information contributes to predicting AUs expressing sadness and disgust.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3382507.3418878","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Providing feedback to a speaker is an essential communication signal for maintaining a conversation. In specific feedback, which indicates the listener's reaction to the speaker?s utterances, the facial expression is an effective modality for conveying the listener's reactions. Moreover, not only the type of facial expressions, but also the degree of intensity of the expressions, may influence the meaning of the specific feedback. In this study, we propose a multimodal deep neural network model that predicts the intensity of facial expressions co-occurring with feedback responses. We focus on multiparty video-mediated communication. In video-mediated communication, close-up frontal face images of each participant are continuously presented on the display; the attention of the participants is more likely to be drawn to the facial expressions. We assume that in such communication, the importance of facial expression in the listeners? feedback responses increases. We collected 33 video-mediated conversations by groups of three people and obtained audio and speech data for each participant. Using the corpus collected as a dataset, we created a deep neural network model that predicts the intensity of 17 types of action units (AUs) co-occurring with the feedback responses. The proposed method employed GRU-based model with attention mechanism for audio, visual, and language modalities. A decoder was trained to produce the intensity values for the 17 AUs frame by frame. In the experiment, unimodal and multimodal models were compared in terms of their performance in predicting salient AUs that characterize facial expression in feedback responses. The results suggest that well-performing models differ depending on the AU categories; audio information was useful for predicting AUs that express happiness, and visual and language information contributes to predicting AUs expressing sadness and disgust.
多方视频媒介交流中伴随反馈反应的面部表情强度估计
向说话者提供反馈是维持对话的基本沟通信号。在具体的反馈中,哪一个表明了听者对说话者的反应?在说话过程中,面部表情是传达听者反应的有效方式。此外,不仅面部表情的类型,而且表情的强度程度,都可能影响具体反馈的含义。在这项研究中,我们提出了一个多模态深度神经网络模型来预测面部表情与反馈反应共同发生的强度。我们专注于多方视频媒介通信。在视频媒介的交流中,每个参与者的正面特写图像连续呈现在显示器上;参与者的注意力更有可能被面部表情所吸引。我们假设在这样的交流中,面部表情对听者的重要性?反馈反应增加。我们收集了33个以视频为媒介的三人小组对话,并获得了每个参与者的音频和语音数据。使用收集到的语料库作为数据集,我们创建了一个深度神经网络模型,该模型预测了与反馈响应共同发生的17种动作单元(au)的强度。该方法采用基于gru的模型,并结合听觉、视觉和语言三种模态的注意机制。我们训练了一个解码器来逐帧生成17 au的强度值。在实验中,比较了单模态和多模态模型在预测反馈反应中面部表情特征的显著特征方面的表现。结果表明,表现良好的模型因AU类别而异;音频信息有助于预测表达快乐的au,而视觉和语言信息有助于预测表达悲伤和厌恶的au。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信