基于高度不平衡野外数据的多模态接触和脱离检测方法

Proceedings of the Workshop on Modeling Cognitive Processes from Multimodal Data Pub Date : 2018-10-16 DOI:10.1145/3279810.3279842

D. Fedotov, O. Perepelkina, E. Kazimirova, M. Konstantinova, W. Minker

{"title":"基于高度不平衡野外数据的多模态接触和脱离检测方法","authors":"D. Fedotov, O. Perepelkina, E. Kazimirova, M. Konstantinova, W. Minker","doi":"10.1145/3279810.3279842","DOIUrl":null,"url":null,"abstract":"Engagement/disengagement detection is a challenging task emerging in a range of human-human and human-computer interaction problems. While being important, the issue is still far from being solved and a number of studies involving in-the-wild data have been conducted by now. Disambiguation in the definition of engaged/disengaged states makes it hard to collect, annotate and analyze such data. In this paper we describe different approaches to building engagement/disengagement models working with highly imbalanced multimodal data from natural conversations. We set a baseline result of 0.695 (unweighted average recall) by direct classification. Then we try to detect disengagement by means of engagement regression models, as they have strong negative correlation. To deal with imbalanced data we apply class weighting and data augmentation techniques (SMOTE and mixup). We experiment with combinations of modalities in order to find the most contributing ones. We use features from both audio (speech) and video (face, body, lips, eyes) channels. We transform original features using Principal Component Analysis and experiment with several types of modality fusion. Finally, we combine approaches and increase the performance up to 0.715 using four modalities (all channels except face). Audio and lips features appear to be the most contributing ones, which may be tightly connected with speech.","PeriodicalId":326513,"journal":{"name":"Proceedings of the Workshop on Modeling Cognitive Processes from Multimodal Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Multimodal approach to engagement and disengagement detection with highly imbalanced in-the-wild data\",\"authors\":\"D. Fedotov, O. Perepelkina, E. Kazimirova, M. Konstantinova, W. Minker\",\"doi\":\"10.1145/3279810.3279842\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Engagement/disengagement detection is a challenging task emerging in a range of human-human and human-computer interaction problems. While being important, the issue is still far from being solved and a number of studies involving in-the-wild data have been conducted by now. Disambiguation in the definition of engaged/disengaged states makes it hard to collect, annotate and analyze such data. In this paper we describe different approaches to building engagement/disengagement models working with highly imbalanced multimodal data from natural conversations. We set a baseline result of 0.695 (unweighted average recall) by direct classification. Then we try to detect disengagement by means of engagement regression models, as they have strong negative correlation. To deal with imbalanced data we apply class weighting and data augmentation techniques (SMOTE and mixup). We experiment with combinations of modalities in order to find the most contributing ones. We use features from both audio (speech) and video (face, body, lips, eyes) channels. We transform original features using Principal Component Analysis and experiment with several types of modality fusion. Finally, we combine approaches and increase the performance up to 0.715 using four modalities (all channels except face). Audio and lips features appear to be the most contributing ones, which may be tightly connected with speech.\",\"PeriodicalId\":326513,\"journal\":{\"name\":\"Proceedings of the Workshop on Modeling Cognitive Processes from Multimodal Data\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Workshop on Modeling Cognitive Processes from Multimodal Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3279810.3279842\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Modeling Cognitive Processes from Multimodal Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3279810.3279842","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

接触/脱离检测是一项具有挑战性的任务，出现在一系列人机交互问题中。虽然很重要，但这个问题还远远没有解决，到目前为止已经进行了一些涉及野外数据的研究。在参与/未参与状态的定义中消除歧义使得收集、注释和分析这些数据变得困难。在本文中，我们描述了构建参与/脱离模型的不同方法，这些模型处理来自自然对话的高度不平衡的多模态数据。我们通过直接分类设置了0.695(未加权平均召回率)的基线结果。然后我们尝试通过投入回归模型来检测脱离，因为它们具有很强的负相关。为了处理不平衡数据，我们应用了类加权和数据增强技术(SMOTE和mixup)。我们尝试多种模式的组合，以找到最有效的模式。我们使用来自音频(语音)和视频(面部、身体、嘴唇、眼睛)通道的特征。我们利用主成分分析对原始特征进行变换，并进行了几种情态融合实验。最后，我们结合方法并使用四种模态(除面外的所有通道)将性能提高到0.715。声音和嘴唇特征似乎是最重要的特征，它们可能与语言密切相关。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal approach to engagement and disengagement detection with highly imbalanced in-the-wild data

Engagement/disengagement detection is a challenging task emerging in a range of human-human and human-computer interaction problems. While being important, the issue is still far from being solved and a number of studies involving in-the-wild data have been conducted by now. Disambiguation in the definition of engaged/disengaged states makes it hard to collect, annotate and analyze such data. In this paper we describe different approaches to building engagement/disengagement models working with highly imbalanced multimodal data from natural conversations. We set a baseline result of 0.695 (unweighted average recall) by direct classification. Then we try to detect disengagement by means of engagement regression models, as they have strong negative correlation. To deal with imbalanced data we apply class weighting and data augmentation techniques (SMOTE and mixup). We experiment with combinations of modalities in order to find the most contributing ones. We use features from both audio (speech) and video (face, body, lips, eyes) channels. We transform original features using Principal Component Analysis and experiment with several types of modality fusion. Finally, we combine approaches and increase the performance up to 0.715 using four modalities (all channels except face). Audio and lips features appear to be the most contributing ones, which may be tightly connected with speech.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Workshop on Modeling Cognitive Processes from Multimodal Data

自引率

0.00%

发文量