{"title":"用多模态变压器检测表达式","authors":"Srinivas Parthasarathy, Shiva Sundaram","doi":"10.1109/SLT48900.2021.9383573","DOIUrl":null,"url":null,"abstract":"Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person’s audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user’s expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Detecting Expressions with Multimodal Transformers\",\"authors\":\"Srinivas Parthasarathy, Shiva Sundaram\",\"doi\":\"10.1109/SLT48900.2021.9383573\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person’s audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user’s expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.\",\"PeriodicalId\":243211,\"journal\":{\"name\":\"2021 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT48900.2021.9383573\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT48900.2021.9383573","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Detecting Expressions with Multimodal Transformers
Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person’s audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user’s expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.