An efficient audiovisual saliency model to predict eye positions when looking at conversations

A. Coutrot, N. Guyader
{"title":"An efficient audiovisual saliency model to predict eye positions when looking at conversations","authors":"A. Coutrot, N. Guyader","doi":"10.1109/EUSIPCO.2015.7362640","DOIUrl":null,"url":null,"abstract":"Classic models of visual attention dramatically fail at predicting eye positions on visual scenes involving faces. While some recent models combine faces with low-level features, none of them consider sound as an input. Yet it is crucial in conversation or meeting scenes. In this paper, we describe and refine an audiovisual saliency model for conversation scenes. This model includes a speaker diarization algorithm which automatically modulates the saliency of conversation partners' faces and bodies according to their speaking-or-not status. To merge our different features into a master saliency map, we use an efficient statistical method (Lasso) allowing a straightforward interpretation of feature relevance. To train and evaluate our model, we run an eye tracking experiment on a publicly available meeting videobase. We show that increasing the saliency of speakers' faces (but not bodies) greatly improves the predictions of our model, compared to previous ones giving an equal and constant weight to each conversation partner.","PeriodicalId":401040,"journal":{"name":"2015 23rd European Signal Processing Conference (EUSIPCO)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 23rd European Signal Processing Conference (EUSIPCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EUSIPCO.2015.7362640","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26

Abstract

Classic models of visual attention dramatically fail at predicting eye positions on visual scenes involving faces. While some recent models combine faces with low-level features, none of them consider sound as an input. Yet it is crucial in conversation or meeting scenes. In this paper, we describe and refine an audiovisual saliency model for conversation scenes. This model includes a speaker diarization algorithm which automatically modulates the saliency of conversation partners' faces and bodies according to their speaking-or-not status. To merge our different features into a master saliency map, we use an efficient statistical method (Lasso) allowing a straightforward interpretation of feature relevance. To train and evaluate our model, we run an eye tracking experiment on a publicly available meeting videobase. We show that increasing the saliency of speakers' faces (but not bodies) greatly improves the predictions of our model, compared to previous ones giving an equal and constant weight to each conversation partner.
一个有效的视听显著性模型,用于预测对话时眼睛的位置
经典的视觉注意模型在预测涉及人脸的视觉场景中的眼睛位置时明显失败。虽然最近的一些模型将人脸与低级特征结合起来,但它们都没有将声音作为输入。然而,在谈话或会议场景中,它是至关重要的。在本文中,我们描述并改进了一个对话场景的视听显著性模型。该模型包括一个说话人特征化算法,该算法根据对话对象的说话或不说话状态自动调节其面部和身体的显著性。为了将不同的特征合并到一个主显著性图中,我们使用了一种有效的统计方法(Lasso),允许对特征相关性进行直接解释。为了训练和评估我们的模型,我们在一个公开的会议视频库上进行了眼动追踪实验。我们的研究表明,与之前的模型相比,增加说话者面部(而不是身体)的显著性大大提高了我们模型的预测效果,而之前的模型给每个对话伙伴赋予了相同和恒定的权重。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信