视频档案中独白检测的视听同步

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). Pub Date : 2003-07-06 DOI:10.1109/ICASSP.2003.1200085

G. Iyengar, H. Nock, C. Neti

{"title":"视频档案中独白检测的视听同步","authors":"G. Iyengar, H. Nock, C. Neti","doi":"10.1109/ICASSP.2003.1200085","DOIUrl":null,"url":null,"abstract":"We present our approach to detecting monologues in video shots. A monologue shot is defined as a shot containing a talking person in the video channel with the corresponding speech in the audio channel. Whilst motivated by the TREC 2002 Video Retrieval Track (VT02), the underlying approach of synchrony between audio and video signals is also applicable for voice and face-based biometrics, assessing lip-synchronization quality in movie editing, and for speaker localization in video. Our approach is envisioned as a two part scheme. We first detect the occurrence of speech and face in a video shot. In shots containing both speech and a face, we distinguish monologue shots as those shots where the speech and facial movements are synchronized. To measure the synchrony between speech and facial movements we use a mutual-information based measure. Experiments with the VT02 corpus indicate that using synchrony, the average precision improves by more than 50% relative compared to using face and speech information alone. Our synchrony based monologue detector submission had the best average precision performance (in VT02) amongst 18 different submissions.","PeriodicalId":104473,"journal":{"name":"2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Audio-visual synchrony for detection of monologues in video archives\",\"authors\":\"G. Iyengar, H. Nock, C. Neti\",\"doi\":\"10.1109/ICASSP.2003.1200085\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present our approach to detecting monologues in video shots. A monologue shot is defined as a shot containing a talking person in the video channel with the corresponding speech in the audio channel. Whilst motivated by the TREC 2002 Video Retrieval Track (VT02), the underlying approach of synchrony between audio and video signals is also applicable for voice and face-based biometrics, assessing lip-synchronization quality in movie editing, and for speaker localization in video. Our approach is envisioned as a two part scheme. We first detect the occurrence of speech and face in a video shot. In shots containing both speech and a face, we distinguish monologue shots as those shots where the speech and facial movements are synchronized. To measure the synchrony between speech and facial movements we use a mutual-information based measure. Experiments with the VT02 corpus indicate that using synchrony, the average precision improves by more than 50% relative compared to using face and speech information alone. Our synchrony based monologue detector submission had the best average precision performance (in VT02) amongst 18 different submissions.\",\"PeriodicalId\":104473,\"journal\":{\"name\":\"2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).\",\"volume\":\"89 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2003-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2003.1200085\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2003.1200085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

摘要

我们提出了在视频镜头中检测独白的方法。独白镜头定义为包含视频通道中说话的人与音频通道中相应的语音的镜头。虽然受到TREC 2002视频检索轨道(VT02)的启发，音频和视频信号之间的基本同步方法也适用于基于语音和面部的生物识别，评估电影编辑中的嘴唇同步质量，以及视频中的说话人定位。我们的方法被设想为一个两部分的方案。我们首先检测视频镜头中语音和人脸的出现。在同时包含言语和面部的镜头中，我们将独白镜头区分为那些言语和面部动作同步的镜头。为了测量语音和面部运动之间的同步性，我们使用了一种基于相互信息的测量方法。在VT02语料库上进行的实验表明，与单独使用人脸和语音信息相比，使用同步识别的平均准确率提高了50%以上。我们的基于同步的独白检测器提交有最好的平均精度性能(在VT02)在18个不同的提交。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Audio-visual synchrony for detection of monologues in video archives

We present our approach to detecting monologues in video shots. A monologue shot is defined as a shot containing a talking person in the video channel with the corresponding speech in the audio channel. Whilst motivated by the TREC 2002 Video Retrieval Track (VT02), the underlying approach of synchrony between audio and video signals is also applicable for voice and face-based biometrics, assessing lip-synchronization quality in movie editing, and for speaker localization in video. Our approach is envisioned as a two part scheme. We first detect the occurrence of speech and face in a video shot. In shots containing both speech and a face, we distinguish monologue shots as those shots where the speech and facial movements are synchronized. To measure the synchrony between speech and facial movements we use a mutual-information based measure. Experiments with the VT02 corpus indicate that using synchrony, the average precision improves by more than 50% relative compared to using face and speech information alone. Our synchrony based monologue detector submission had the best average precision performance (in VT02) amongst 18 different submissions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).

自引率

0.00%

发文量