Detection of Inconsistency Between Subject and Speaker Based on the Co-occurrence of Lip Motion and Voice Towards Speech Scene Extraction from News Videos

2011 IEEE International Symposium on Multimedia Pub Date : 2011-12-05 DOI:10.1109/ISM.2011.56

S. Kumagai, Keisuke Doman, Tomokazu Takahashi, Daisuke Deguchi, I. Ide, H. Murase

{"title":"Detection of Inconsistency Between Subject and Speaker Based on the Co-occurrence of Lip Motion and Voice Towards Speech Scene Extraction from News Videos","authors":"S. Kumagai, Keisuke Doman, Tomokazu Takahashi, Daisuke Deguchi, I. Ide, H. Murase","doi":"10.1109/ISM.2011.56","DOIUrl":null,"url":null,"abstract":"We propose a method to detect the inconsistency between a subject and the speaker for extracting speech scenes from news videos. Speech scenes in news videos contain a wealth of multimedia information, and are valuable as archived material. In order to extract speech scenes from news videos, there is an approach that uses the position and size of a face region. However, it is difficult to extract them with only such approach, since news videos contain non-speech scenes where the speaker is not the subject, such as narrated scenes. To solve this problem, we propose a method to discriminate between speech scenes and narrated scenes based on the co-occurrence between a subject's lip motion and the speaker's voice. The proposed method uses lip shape and degree of lip opening as visual features representing a subject's lip motion, and uses voice volume and phoneme as audio feature representing a speaker's voice. Then, the proposed method discriminates between speech scenes and narrated scenes based on the correlations of these features. We report the results of experiments on videos captured in a laboratory condition and also on actual broadcast news videos. Their results showed the effectiveness of our method and the feasibility of our research goal.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Symposium on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISM.2011.56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

We propose a method to detect the inconsistency between a subject and the speaker for extracting speech scenes from news videos. Speech scenes in news videos contain a wealth of multimedia information, and are valuable as archived material. In order to extract speech scenes from news videos, there is an approach that uses the position and size of a face region. However, it is difficult to extract them with only such approach, since news videos contain non-speech scenes where the speaker is not the subject, such as narrated scenes. To solve this problem, we propose a method to discriminate between speech scenes and narrated scenes based on the co-occurrence between a subject's lip motion and the speaker's voice. The proposed method uses lip shape and degree of lip opening as visual features representing a subject's lip motion, and uses voice volume and phoneme as audio feature representing a speaker's voice. Then, the proposed method discriminates between speech scenes and narrated scenes based on the correlations of these features. We report the results of experiments on videos captured in a laboratory condition and also on actual broadcast news videos. Their results showed the effectiveness of our method and the feasibility of our research goal.

查看原文本刊更多论文

基于唇动与语音共现的新闻视频语音场景提取中的主说话不一致检测

本文提出了一种检测新闻视频中主讲人不一致的方法。新闻视频中的语音场景包含了丰富的多媒体信息，是有价值的档案资料。为了从新闻视频中提取语音场景，有一种利用人脸区域的位置和大小的方法。但是，由于新闻视频中包含了说话人不是主体的非言语场景，比如旁白场景，因此仅用这种方法很难提取出来。为了解决这一问题，我们提出了一种基于说话人的声音和说话人的嘴唇动作的共现性来区分语音场景和叙述场景的方法。该方法使用唇形和嘴唇张开程度作为代表受试者嘴唇运动的视觉特征，使用音量和音素作为代表说话人声音的音频特征。然后，该方法基于这些特征的相关性来区分语音场景和叙述场景。我们报告了在实验室条件下拍摄的视频和实际广播新闻视频的实验结果。结果表明了方法的有效性和研究目标的可行性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE International Symposium on Multimedia

自引率

0.00%

发文量