An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content

2013 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB) Pub Date : 2013-12-09 DOI:10.1109/ISM.2013.27

Benjamin Elizalde, Howard Lei, G. Friedland

{"title":"An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content","authors":"Benjamin Elizalde, Howard Lei, G. Friedland","doi":"10.1109/ISM.2013.27","DOIUrl":null,"url":null,"abstract":"Audio-based video event detection (VED) on user-generated content (UGC) aims to find videos that show an observable event such as a wedding ceremony or birthday party rather than a sound, such as music, clapping or singing. The difficulty of video content analysis on UGC lies in the acoustic variability and lack of structure of the data. The UGC task has been explored mainly by computer vision, but can be benefited by the used of audio. The i-vector system is state-of-the-art in Speaker Verification, and is outperforming a conventional Gaussian Mixture Model (GMM)-based approach. The system compensates for undesired acoustic variability and extracts information from the acoustic environment, making it a meaningful choice for detection on UGC. This paper employs the i-vector-based system for audio-based VED on UGC and expands the understanding of the system on the task. It also includes a performance comparison with the conventional GMM-based and state-of-the-art Random Forest (RF)-based systems. The i-vector system aids audio-based event detection by addressing UGC audio characteristics. It outperforms the GMM-based system, and is competitive with the RF-based system in terms of the Missed Detection (MD) rate at 4% and 2.8% False Alarm (FA) rates, and complements the RF-based system by demonstrating slightly improvement in combination over the standalone systems.","PeriodicalId":6311,"journal":{"name":"2013 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB)","volume":"61 1","pages":"114-117"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISM.2013.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Audio-based video event detection (VED) on user-generated content (UGC) aims to find videos that show an observable event such as a wedding ceremony or birthday party rather than a sound, such as music, clapping or singing. The difficulty of video content analysis on UGC lies in the acoustic variability and lack of structure of the data. The UGC task has been explored mainly by computer vision, but can be benefited by the used of audio. The i-vector system is state-of-the-art in Speaker Verification, and is outperforming a conventional Gaussian Mixture Model (GMM)-based approach. The system compensates for undesired acoustic variability and extracts information from the acoustic environment, making it a meaningful choice for detection on UGC. This paper employs the i-vector-based system for audio-based VED on UGC and expands the understanding of the system on the task. It also includes a performance comparison with the conventional GMM-based and state-of-the-art Random Forest (RF)-based systems. The i-vector system aids audio-based event detection by addressing UGC audio characteristics. It outperforms the GMM-based system, and is competitive with the RF-based system in terms of the Missed Detection (MD) rate at 4% and 2.8% False Alarm (FA) rates, and complements the RF-based system by demonstrating slightly improvement in combination over the standalone systems.

查看原文本刊更多论文

基于用户生成内容的基于音频的视频事件检测声学环境的i向量表示

针对用户生成内容(UGC)的基于音频的视频事件检测(VED)旨在发现显示可观察到的事件(如婚礼或生日派对)的视频，而不是音乐、鼓掌或唱歌等声音。UGC视频内容分析的难点在于数据的声学变异性和缺乏结构化。UGC任务主要是通过计算机视觉来探索的，但音频的使用也可以从中受益。i向量系统是最先进的说话人验证，并优于传统的高斯混合模型(GMM)为基础的方法。该系统补偿了不期望的声学变异性，并从声学环境中提取信息，使其成为检测UGC的有意义的选择。本文采用基于i向量的系统实现基于UGC的基于音频的视频生成，扩展了系统对任务的理解。它还包括与传统的基于gmm和最先进的基于随机森林(RF)的系统的性能比较。i-vector系统通过处理UGC音频特征来辅助基于音频的事件检测。它优于基于gmm的系统，并且在4%的未检出率(MD)和2.8%的误报率(FA)方面与基于rf的系统具有竞争力，并且通过与独立系统的组合略有改进来补充基于rf的系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB)

自引率

0.00%

发文量