Attitude recognition of video bloggers using audio-visual descriptors

F. Haider, L. Cerrato, S. Luz, N. Campbell
{"title":"Attitude recognition of video bloggers using audio-visual descriptors","authors":"F. Haider, L. Cerrato, S. Luz, N. Campbell","doi":"10.1145/3011263.3011270","DOIUrl":null,"url":null,"abstract":"In social media, vlogs (video blogs) are a form of unidirectional communication, where the vloggers (video bloggers) convey their messages (opinions, thoughts, etc.) to a potential audience which cannot give them feedback in real time. In this kind of communication, the non-verbal behaviour and personality impression of a video blogger tends to influence viewers' attention because non-verbal cues are correlated with the messages conveyed by a vlogger. In this study, we use the acoustic and visual features (body movements that are captured by low-level visual descriptors) to predict the six different attitudes (amusement, enthusiasm, friendliness, frustration, impatience and neutral) annotated in the speech of 10 video bloggers. The automatic detection of attitude can be helpful in a scenario where a machine has to automatically provide feedback to bloggers about their performance in terms of the extent to which they manage to engage the audience by displaying certain attitudes. Attitude recognition models are trained using the random forest classifier. Results show that: 1) acoustic features provide better accuracy than the visual features, 2) while fusion of audio and visual features does not increase overall accuracy, it improves the results for some attitudes and subjects, and 3) densely extracted histograms of flow provide better results than other visual descriptors. A three-class (positive, negative and neutral attitudes) problem has also been defined. Results for this setting show that feature fusion degrades overall classifier accuracy, and the classifiers perform better on the original six-class problem than on the three-class setting.","PeriodicalId":272696,"journal":{"name":"Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3011263.3011270","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

In social media, vlogs (video blogs) are a form of unidirectional communication, where the vloggers (video bloggers) convey their messages (opinions, thoughts, etc.) to a potential audience which cannot give them feedback in real time. In this kind of communication, the non-verbal behaviour and personality impression of a video blogger tends to influence viewers' attention because non-verbal cues are correlated with the messages conveyed by a vlogger. In this study, we use the acoustic and visual features (body movements that are captured by low-level visual descriptors) to predict the six different attitudes (amusement, enthusiasm, friendliness, frustration, impatience and neutral) annotated in the speech of 10 video bloggers. The automatic detection of attitude can be helpful in a scenario where a machine has to automatically provide feedback to bloggers about their performance in terms of the extent to which they manage to engage the audience by displaying certain attitudes. Attitude recognition models are trained using the random forest classifier. Results show that: 1) acoustic features provide better accuracy than the visual features, 2) while fusion of audio and visual features does not increase overall accuracy, it improves the results for some attitudes and subjects, and 3) densely extracted histograms of flow provide better results than other visual descriptors. A three-class (positive, negative and neutral attitudes) problem has also been defined. Results for this setting show that feature fusion degrades overall classifier accuracy, and the classifiers perform better on the original six-class problem than on the three-class setting.
使用视听描述符的视频博主态度识别
在社交媒体中,vlog(视频博客)是一种单向的交流形式,视频博主将他们的信息(观点、想法等)传达给潜在的受众,而这些受众无法实时地给他们反馈。在这种交流中,视频博主的非语言行为和个性印象往往会影响观众的注意力,因为非语言线索与视频博主传达的信息是相关的。在这项研究中,我们使用声音和视觉特征(由低级视觉描述符捕获的身体动作)来预测10个视频博主演讲中注释的六种不同态度(娱乐,热情,友好,沮丧,不耐烦和中立)。这种态度的自动检测在一个场景中是很有帮助的,在这个场景中,机器必须自动向博主提供他们的表现反馈,根据他们通过展示某种态度来吸引观众的程度。使用随机森林分类器训练姿态识别模型。结果表明:1)声学特征比视觉特征提供了更好的准确率;2)视听特征融合不提高整体准确率,但对某些姿态和被试的结果有所改善;3)密集提取的流量直方图比其他视觉描述符提供了更好的结果。还定义了一个三类(积极、消极和中性态度)问题。该设置的结果表明,特征融合降低了分类器的整体精度,并且分类器在原始六类问题上的表现优于三类设置。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信