Estimating Visual Focus of Attention in Multiparty Meetings using Deep Convolutional Neural Networks

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI:10.1145/3242969.3242973

K. Otsuka, Keisuke Kasuga, Martina Köhler

{"title":"Estimating Visual Focus of Attention in Multiparty Meetings using Deep Convolutional Neural Networks","authors":"K. Otsuka, Keisuke Kasuga, Martina Köhler","doi":"10.1145/3242969.3242973","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) are employed to estimate the visual focus of attention (VFoA), also called gaze direction , in multiparty face-to-face meetings on the basis of multimodal nonverbal behaviors including head pose, direction of the eyeball, and presence/absence of utterance. To reveal the potential of CNNs, we focus on aspects of multimodal and multiparty fusion including individual/group models, early/late fusion, and robustness when using inputs from image-based trackers. In contrast to the individual model that separately targets each person specific to one's seat, the group model aims to jointly estimate the gaze directions of all participants. Experiments confirmed that the group model outperformed the individual model especially in predicting listeners' VFoA when the inputs did not include eyeball directions. This result indicates that the group CNN model can implicitly learn underlying conversation structures, e.g., the listeners' gazes converge on the speaker. When the eyeball direction feature is available, both models outperformed the Bayes models used for comparison. In this case, the individual model was superior to the group model, particularly in estimating the speaker's VFoA. Moreover, it was revealed that in group models, two-stage late fusion, which integrates an individual features first, and multiparty features second, outperformed other structures. Furthermore, our experiment confirmed that image-based tracking can provide a comparable level of performance to that of sensor-based measurements. Overall, the results suggest that the CNN is a promising approach for VFoA estimation.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3242969.3242973","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

Convolutional neural networks (CNNs) are employed to estimate the visual focus of attention (VFoA), also called gaze direction , in multiparty face-to-face meetings on the basis of multimodal nonverbal behaviors including head pose, direction of the eyeball, and presence/absence of utterance. To reveal the potential of CNNs, we focus on aspects of multimodal and multiparty fusion including individual/group models, early/late fusion, and robustness when using inputs from image-based trackers. In contrast to the individual model that separately targets each person specific to one's seat, the group model aims to jointly estimate the gaze directions of all participants. Experiments confirmed that the group model outperformed the individual model especially in predicting listeners' VFoA when the inputs did not include eyeball directions. This result indicates that the group CNN model can implicitly learn underlying conversation structures, e.g., the listeners' gazes converge on the speaker. When the eyeball direction feature is available, both models outperformed the Bayes models used for comparison. In this case, the individual model was superior to the group model, particularly in estimating the speaker's VFoA. Moreover, it was revealed that in group models, two-stage late fusion, which integrates an individual features first, and multiparty features second, outperformed other structures. Furthermore, our experiment confirmed that image-based tracking can provide a comparable level of performance to that of sensor-based measurements. Overall, the results suggest that the CNN is a promising approach for VFoA estimation.

查看原文本刊更多论文

基于深度卷积神经网络的多人会议视觉焦点估计

卷积神经网络(Convolutional neural networks, cnn)基于头部姿势、眼球方向、有无话语等多模态非语言行为，在多人面对面会议中估计视觉注意力焦点(visual focus of attention, VFoA)，也称为凝视方向。为了揭示cnn的潜力，我们关注了多模态和多方融合的各个方面，包括个体/群体模型、早期/晚期融合以及使用基于图像的跟踪器输入时的鲁棒性。与个体模型分别针对每个人特定的座位不同，群体模型旨在共同估计所有参与者的凝视方向。实验证实，当输入不包括眼球方向时，群体模型的表现优于个体模型，尤其是在预测听众的视foa方面。这一结果表明，群体CNN模型可以隐式学习潜在的会话结构，例如，听者的目光集中在说话者身上。当眼球方向特征可用时，两种模型都优于用于比较的贝叶斯模型。在这种情况下，个体模型优于群体模型，特别是在估计说话者的VFoA方面。此外，在群体模型中，先融合个体特征，后融合多方特征的两阶段后期融合优于其他结构。此外，我们的实验证实，基于图像的跟踪可以提供与基于传感器的测量相当的性能水平。总的来说，结果表明CNN是一种很有前途的VFoA估计方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 20th ACM International Conference on Multimodal Interaction

自引率

0.00%

发文量