编辑录音中多模态簇的无监督检测

2010 IEEE International Workshop on Multimedia Signal Processing Pub Date : 2010-12-10 DOI:10.1109/MMSP.2010.5662015

Alfred Dielmann

{"title":"编辑录音中多模态簇的无监督检测","authors":"Alfred Dielmann","doi":"10.1109/MMSP.2010.5662015","DOIUrl":null,"url":null,"abstract":"Edited video recordings, such as talk-shows and sitcoms, often include Audio-Visual clusters: frequent repetitions of closely related acoustic and visual content. For example during a political debate, every time that a given participant holds the conversational floor, her/his voice tends to co-occur with camera views (i.e. shots) showing her/his portrait. Differently from the previous Audio-Visual clustering works, this paper proposes an unsupervised approach that detects Audio-Visual clusters, avoiding to make assumptions on the recording content, such as the presence of specific participant voices or faces. Sequences of audio and shot clusters are automatically identified using unsupervised audio diarization and shot segmentation techniques. Audio-Visual clusters are then formed by ranking the co-occurrences between these two segmentations and selecting those which significantly go beyond chance. Numerical experiments performed on a collection of 70 political debates, comprising more than 43 hours of live edited recordings, showed that automatically extracted AudioVisual clusters well match the ground-truth annotation, achieving high purity performances.","PeriodicalId":105774,"journal":{"name":"2010 IEEE International Workshop on Multimedia Signal Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Unsupervised detection of multimodal clusters in edited recordings\",\"authors\":\"Alfred Dielmann\",\"doi\":\"10.1109/MMSP.2010.5662015\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Edited video recordings, such as talk-shows and sitcoms, often include Audio-Visual clusters: frequent repetitions of closely related acoustic and visual content. For example during a political debate, every time that a given participant holds the conversational floor, her/his voice tends to co-occur with camera views (i.e. shots) showing her/his portrait. Differently from the previous Audio-Visual clustering works, this paper proposes an unsupervised approach that detects Audio-Visual clusters, avoiding to make assumptions on the recording content, such as the presence of specific participant voices or faces. Sequences of audio and shot clusters are automatically identified using unsupervised audio diarization and shot segmentation techniques. Audio-Visual clusters are then formed by ranking the co-occurrences between these two segmentations and selecting those which significantly go beyond chance. Numerical experiments performed on a collection of 70 political debates, comprising more than 43 hours of live edited recordings, showed that automatically extracted AudioVisual clusters well match the ground-truth annotation, achieving high purity performances.\",\"PeriodicalId\":105774,\"journal\":{\"name\":\"2010 IEEE International Workshop on Multimedia Signal Processing\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Workshop on Multimedia Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MMSP.2010.5662015\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Workshop on Multimedia Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MMSP.2010.5662015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

编辑过的录像，如谈话节目和情景喜剧，通常包括视听组:密切相关的声音和视觉内容的频繁重复。例如，在一场政治辩论中，每当一个给定的参与者占据谈话空间时，她/他的声音往往与显示她/他肖像的镜头同时出现。与以往的视听聚类工作不同，本文提出了一种无监督的方法来检测视听聚类，避免了对录音内容进行假设，例如特定参与者的声音或面孔的存在。使用无监督音频化和镜头分割技术自动识别音频和镜头簇序列。然后通过对这两个部分之间的共同出现进行排序并选择那些明显超出偶然的部分来形成视听集群。在70场政治辩论中进行的数值实验，包括超过43小时的现场编辑录音，表明自动提取的视听集群与基础事实注释很好地匹配，实现了高纯度的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Unsupervised detection of multimodal clusters in edited recordings

Edited video recordings, such as talk-shows and sitcoms, often include Audio-Visual clusters: frequent repetitions of closely related acoustic and visual content. For example during a political debate, every time that a given participant holds the conversational floor, her/his voice tends to co-occur with camera views (i.e. shots) showing her/his portrait. Differently from the previous Audio-Visual clustering works, this paper proposes an unsupervised approach that detects Audio-Visual clusters, avoiding to make assumptions on the recording content, such as the presence of specific participant voices or faces. Sequences of audio and shot clusters are automatically identified using unsupervised audio diarization and shot segmentation techniques. Audio-Visual clusters are then formed by ranking the co-occurrences between these two segmentations and selecting those which significantly go beyond chance. Numerical experiments performed on a collection of 70 political debates, comprising more than 43 hours of live edited recordings, showed that automatically extracted AudioVisual clusters well match the ground-truth annotation, achieving high purity performances.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 IEEE International Workshop on Multimedia Signal Processing

自引率

0.00%

发文量