{"title":"Evaluation of optical flow field features for the detection of word prominence in a human-machine interaction scenario","authors":"Andrea Schnall, M. Heckmann","doi":"10.1109/IJCNN.2015.7280639","DOIUrl":null,"url":null,"abstract":"In this paper we investigate optical flow field features for the automatic labeling of word prominence. Visual motion is a rich source of information. Modifying the articulatory parameters to raise the prominence of a segment of an utterance, is usually accompanied by a stronger movement of mouth and head compared to a non-prominent segment. One way to describe such motion is to use optical flow fields. During the recording of the audio-visual database we used for the following experiments, the subjects were asked to make corrections for a misunderstanding of a single word of the system by using prosodic cues only, which created a narrow and a broad focus. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features duration, loudness, fundamental frequency and spectral emphasis were calculated. From the visual channel the nose position is detected and the mouth region is extracted. From this region the optical flow is calculated and all the optical flow fields for one word are summed up. The pooled optical flow for the four directions is then used as feature vector. We demonstrate that using these features in addition to the audio features can improve the classification results for some speakers. We also compare the optical flow field features to other visual features, the nose position and image transformation based visual features. The optical flow field features incorporate not as much information as image transformation based visual features, but using both in addition to the audio features leads to the overall best results, which shows that they contain complementary information.","PeriodicalId":6539,"journal":{"name":"2015 International Joint Conference on Neural Networks (IJCNN)","volume":"36 1","pages":"1-7"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN.2015.7280639","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
In this paper we investigate optical flow field features for the automatic labeling of word prominence. Visual motion is a rich source of information. Modifying the articulatory parameters to raise the prominence of a segment of an utterance, is usually accompanied by a stronger movement of mouth and head compared to a non-prominent segment. One way to describe such motion is to use optical flow fields. During the recording of the audio-visual database we used for the following experiments, the subjects were asked to make corrections for a misunderstanding of a single word of the system by using prosodic cues only, which created a narrow and a broad focus. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features duration, loudness, fundamental frequency and spectral emphasis were calculated. From the visual channel the nose position is detected and the mouth region is extracted. From this region the optical flow is calculated and all the optical flow fields for one word are summed up. The pooled optical flow for the four directions is then used as feature vector. We demonstrate that using these features in addition to the audio features can improve the classification results for some speakers. We also compare the optical flow field features to other visual features, the nose position and image transformation based visual features. The optical flow field features incorporate not as much information as image transformation based visual features, but using both in addition to the audio features leads to the overall best results, which shows that they contain complementary information.