编辑录音中多模态簇的无监督检测

Alfred Dielmann
{"title":"编辑录音中多模态簇的无监督检测","authors":"Alfred Dielmann","doi":"10.1109/MMSP.2010.5662015","DOIUrl":null,"url":null,"abstract":"Edited video recordings, such as talk-shows and sitcoms, often include Audio-Visual clusters: frequent repetitions of closely related acoustic and visual content. For example during a political debate, every time that a given participant holds the conversational floor, her/his voice tends to co-occur with camera views (i.e. shots) showing her/his portrait. Differently from the previous Audio-Visual clustering works, this paper proposes an unsupervised approach that detects Audio-Visual clusters, avoiding to make assumptions on the recording content, such as the presence of specific participant voices or faces. Sequences of audio and shot clusters are automatically identified using unsupervised audio diarization and shot segmentation techniques. Audio-Visual clusters are then formed by ranking the co-occurrences between these two segmentations and selecting those which significantly go beyond chance. Numerical experiments performed on a collection of 70 political debates, comprising more than 43 hours of live edited recordings, showed that automatically extracted AudioVisual clusters well match the ground-truth annotation, achieving high purity performances.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Unsupervised detection of multimodal clusters in edited recordings\",\"authors\":\"Alfred Dielmann\",\"doi\":\"10.1109/MMSP.2010.5662015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Edited video recordings, such as talk-shows and sitcoms, often include Audio-Visual clusters: frequent repetitions of closely related acoustic and visual content. For example during a political debate, every time that a given participant holds the conversational floor, her/his voice tends to co-occur with camera views (i.e. shots) showing her/his portrait. Differently from the previous Audio-Visual clustering works, this paper proposes an unsupervised approach that detects Audio-Visual clusters, avoiding to make assumptions on the recording content, such as the presence of specific participant voices or faces. Sequences of audio and shot clusters are automatically identified using unsupervised audio diarization and shot segmentation techniques. Audio-Visual clusters are then formed by ranking the co-occurrences between these two segmentations and selecting those which significantly go beyond chance. Numerical experiments performed on a collection of 70 political debates, comprising more than 43 hours of live edited recordings, showed that automatically extracted AudioVisual clusters well match the ground-truth annotation, achieving high purity performances.\",\"PeriodicalId\":105774,\"journal\":{\"name\":\"2010 IEEE International Workshop on Multimedia Signal Processing\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Workshop on Multimedia Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MMSP.2010.5662015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Workshop on Multimedia Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MMSP.2010.5662015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

编辑过的录像,如谈话节目和情景喜剧,通常包括视听组:密切相关的声音和视觉内容的频繁重复。例如,在一场政治辩论中,每当一个给定的参与者占据谈话空间时,她/他的声音往往与显示她/他肖像的镜头同时出现。与以往的视听聚类工作不同,本文提出了一种无监督的方法来检测视听聚类,避免了对录音内容进行假设,例如特定参与者的声音或面孔的存在。使用无监督音频化和镜头分割技术自动识别音频和镜头簇序列。然后通过对这两个部分之间的共同出现进行排序并选择那些明显超出偶然的部分来形成视听集群。在70场政治辩论中进行的数值实验,包括超过43小时的现场编辑录音,表明自动提取的视听集群与基础事实注释很好地匹配,实现了高纯度的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Unsupervised detection of multimodal clusters in edited recordings
Edited video recordings, such as talk-shows and sitcoms, often include Audio-Visual clusters: frequent repetitions of closely related acoustic and visual content. For example during a political debate, every time that a given participant holds the conversational floor, her/his voice tends to co-occur with camera views (i.e. shots) showing her/his portrait. Differently from the previous Audio-Visual clustering works, this paper proposes an unsupervised approach that detects Audio-Visual clusters, avoiding to make assumptions on the recording content, such as the presence of specific participant voices or faces. Sequences of audio and shot clusters are automatically identified using unsupervised audio diarization and shot segmentation techniques. Audio-Visual clusters are then formed by ranking the co-occurrences between these two segmentations and selecting those which significantly go beyond chance. Numerical experiments performed on a collection of 70 political debates, comprising more than 43 hours of live edited recordings, showed that automatically extracted AudioVisual clusters well match the ground-truth annotation, achieving high purity performances.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信