基于多媒体检索的说话人特征化在音频概念检测中的应用

2011 IEEE International Symposium on Multimedia Pub Date : 2011-12-05 DOI:10.1109/ISM.2011.79

R. Mertens, Po-Sen Huang, L. Gottlieb, G. Friedland, Ajay Divakaran

{"title":"基于多媒体检索的说话人特征化在音频概念检测中的应用","authors":"R. Mertens, Po-Sen Huang, L. Gottlieb, G. Friedland, Ajay Divakaran","doi":"10.1109/ISM.2011.79","DOIUrl":null,"url":null,"abstract":"Recently, audio concepts emerged as a useful building block in multimodal video retrieval systems. Information like \"this file contains laughter\", \"this file contains engine sounds\" or \"this file contains slow music\" can significantly improve purely visual based retrieval. The weak point of current approaches to audio concept detection is that they heavily rely on human annotators. In most approaches, audio material is manually inspected to identify relevant concepts. Then instances that contain examples of relevant concepts are selected -- again manually -- and used to train concept detectors. This approach comes with two major disadvantages: (1) it leads to rather abstract audio concepts that hardly cover the audio domain at hand and (2) the way human annotators identify audio concepts likely differs from the way a computer algorithm clusters audio data -- introducing additional noise in training data. This paper explores whether unsupervized audio segementation systems can be used to identify useful audio concepts by analyzing training data automatically and whether these audio concepts can be used for multimedia document classification and retrieval. A modified version of the ICSI (International Computer Science Institute) speaker diarization system finds segments in an audio track that have similar perceptual properties and groups these segments. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval\",\"authors\":\"R. Mertens, Po-Sen Huang, L. Gottlieb, G. Friedland, Ajay Divakaran\",\"doi\":\"10.1109/ISM.2011.79\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, audio concepts emerged as a useful building block in multimodal video retrieval systems. Information like \\\"this file contains laughter\\\", \\\"this file contains engine sounds\\\" or \\\"this file contains slow music\\\" can significantly improve purely visual based retrieval. The weak point of current approaches to audio concept detection is that they heavily rely on human annotators. In most approaches, audio material is manually inspected to identify relevant concepts. Then instances that contain examples of relevant concepts are selected -- again manually -- and used to train concept detectors. This approach comes with two major disadvantages: (1) it leads to rather abstract audio concepts that hardly cover the audio domain at hand and (2) the way human annotators identify audio concepts likely differs from the way a computer algorithm clusters audio data -- introducing additional noise in training data. This paper explores whether unsupervized audio segementation systems can be used to identify useful audio concepts by analyzing training data automatically and whether these audio concepts can be used for multimedia document classification and retrieval. A modified version of the ICSI (International Computer Science Institute) speaker diarization system finds segments in an audio track that have similar perceptual properties and groups these segments. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another.\",\"PeriodicalId\":339410,\"journal\":{\"name\":\"2011 IEEE International Symposium on Multimedia\",\"volume\":\"63 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE International Symposium on Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISM.2011.79\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Symposium on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISM.2011.79","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

最近，音频概念在多模式视频检索系统中成为一个有用的构建块。像“这个文件包含笑声”、“这个文件包含引擎声音”或“这个文件包含慢音乐”这样的信息可以显著提高纯粹基于视觉的检索。当前音频概念检测方法的弱点是它们严重依赖于人类注释者。在大多数方法中，音频材料是手动检查以识别相关概念。然后选择包含相关概念示例的实例(同样是手动的)，并用于训练概念检测器。这种方法有两个主要缺点:(1)它导致相当抽象的音频概念，几乎无法覆盖手头的音频领域;(2)人类注释器识别音频概念的方式可能与计算机算法聚类音频数据的方式不同——在训练数据中引入额外的噪声。本文探讨了无监督音频分割系统是否可以通过自动分析训练数据来识别有用的音频概念，以及这些音频概念是否可以用于多媒体文档的分类和检索。ICSI(国际计算机科学研究所)的一个改进版本的说话人分类系统在音轨中发现具有相似感知特性的片段，并将这些片段分组。本文深入分析了在预定义的文档集中，由diarization系统识别的相似声学段的统计特性，以及该方法在区分不同文档类别方面的理论适应度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval

Recently, audio concepts emerged as a useful building block in multimodal video retrieval systems. Information like "this file contains laughter", "this file contains engine sounds" or "this file contains slow music" can significantly improve purely visual based retrieval. The weak point of current approaches to audio concept detection is that they heavily rely on human annotators. In most approaches, audio material is manually inspected to identify relevant concepts. Then instances that contain examples of relevant concepts are selected -- again manually -- and used to train concept detectors. This approach comes with two major disadvantages: (1) it leads to rather abstract audio concepts that hardly cover the audio domain at hand and (2) the way human annotators identify audio concepts likely differs from the way a computer algorithm clusters audio data -- introducing additional noise in training data. This paper explores whether unsupervized audio segementation systems can be used to identify useful audio concepts by analyzing training data automatically and whether these audio concepts can be used for multimedia document classification and retrieval. A modified version of the ICSI (International Computer Science Institute) speaker diarization system finds segments in an audio track that have similar perceptual properties and groups these segments. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE International Symposium on Multimedia

自引率

0.00%

发文量