Detecting Expressions with Multimodal Transformers

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-30 DOI:10.1109/SLT48900.2021.9383573

Srinivas Parthasarathy, Shiva Sundaram

{"title":"Detecting Expressions with Multimodal Transformers","authors":"Srinivas Parthasarathy, Shiva Sundaram","doi":"10.1109/SLT48900.2021.9383573","DOIUrl":null,"url":null,"abstract":"Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person’s audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user’s expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT48900.2021.9383573","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person’s audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user’s expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.

查看原文本刊更多论文

用多模态变压器检测表达式

开发机器学习算法来理解人与人之间的互动，可以为亚马逊Alexa等公共设备带来自然的用户体验。除了声音活动和凝视等线索外，一个人的视听表情(包括声音的语调和面部表情)是对话双方参与的隐含信号。本研究探讨了用于用户表情视听检测的深度学习算法。我们首先实现了一个具有循环层的视听基线模型，该模型显示了与当前技术状态相比具有竞争力的结果。接下来，我们提出了具有编码器层的转换器架构，它可以更好地集成用于表达式跟踪的视听特征。在Aff-Wild2数据库上的性能表明，所提出的方法比具有循环层的基线架构性能更好，唤醒和价态描述符的绝对增益约为2%。此外，与单模态训练的模型相比，多模态架构显示出显著的改进，增益高达3.6%。消融研究表明，视觉模态对af - wild2数据库的表达检测具有重要意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量