Evaluation of optical flow field features for the detection of word prominence in a human-machine interaction scenario

Andrea Schnall, M. Heckmann
{"title":"Evaluation of optical flow field features for the detection of word prominence in a human-machine interaction scenario","authors":"Andrea Schnall, M. Heckmann","doi":"10.1109/IJCNN.2015.7280639","DOIUrl":null,"url":null,"abstract":"In this paper we investigate optical flow field features for the automatic labeling of word prominence. Visual motion is a rich source of information. Modifying the articulatory parameters to raise the prominence of a segment of an utterance, is usually accompanied by a stronger movement of mouth and head compared to a non-prominent segment. One way to describe such motion is to use optical flow fields. During the recording of the audio-visual database we used for the following experiments, the subjects were asked to make corrections for a misunderstanding of a single word of the system by using prosodic cues only, which created a narrow and a broad focus. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features duration, loudness, fundamental frequency and spectral emphasis were calculated. From the visual channel the nose position is detected and the mouth region is extracted. From this region the optical flow is calculated and all the optical flow fields for one word are summed up. The pooled optical flow for the four directions is then used as feature vector. We demonstrate that using these features in addition to the audio features can improve the classification results for some speakers. We also compare the optical flow field features to other visual features, the nose position and image transformation based visual features. The optical flow field features incorporate not as much information as image transformation based visual features, but using both in addition to the audio features leads to the overall best results, which shows that they contain complementary information.","PeriodicalId":6539,"journal":{"name":"2015 International Joint Conference on Neural Networks (IJCNN)","volume":"36 1","pages":"1-7"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN.2015.7280639","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In this paper we investigate optical flow field features for the automatic labeling of word prominence. Visual motion is a rich source of information. Modifying the articulatory parameters to raise the prominence of a segment of an utterance, is usually accompanied by a stronger movement of mouth and head compared to a non-prominent segment. One way to describe such motion is to use optical flow fields. During the recording of the audio-visual database we used for the following experiments, the subjects were asked to make corrections for a misunderstanding of a single word of the system by using prosodic cues only, which created a narrow and a broad focus. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features duration, loudness, fundamental frequency and spectral emphasis were calculated. From the visual channel the nose position is detected and the mouth region is extracted. From this region the optical flow is calculated and all the optical flow fields for one word are summed up. The pooled optical flow for the four directions is then used as feature vector. We demonstrate that using these features in addition to the audio features can improve the classification results for some speakers. We also compare the optical flow field features to other visual features, the nose position and image transformation based visual features. The optical flow field features incorporate not as much information as image transformation based visual features, but using both in addition to the audio features leads to the overall best results, which shows that they contain complementary information.
评价人机交互场景中检测单词突出的光流场特征
本文研究了用于单词突出自动标注的光流场特征。视觉运动是一个丰富的信息源。修改发音参数以提高话语中某一段的突出程度,通常伴随着嘴和头部的运动比不突出的部分更强烈。描述这种运动的一种方法是使用光流场。在记录我们用于后续实验的视听数据库的过程中,我们要求受试者只使用韵律线索来纠正系统中单个单词的误解,这创造了一个狭窄和广泛的焦点。用远处的麦克风进行视听记录,不做任何视觉标记。作为声学特征,计算了持续时间、响度、基频和频谱重点。从视觉通道中检测鼻子位置并提取口腔区域。从这个区域计算光流,总结出一个字的所有光流场。然后将四个方向的汇聚光流用作特征向量。我们证明,除了音频特征之外,使用这些特征可以改善一些说话者的分类结果。我们还将光流场特征与其他视觉特征、鼻子位置和基于图像变换的视觉特征进行了比较。光流场特征包含的信息不如基于图像变换的视觉特征多,但除了音频特征外,还使用这两种特征会导致总体上最好的结果,这表明它们包含互补的信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信