基于音频的多媒体事件分类的神经网络研究

MMCommons '15 Pub Date : 2015-10-30 DOI:10.1145/2814815.2814816

M. Ravanelli, Benjamin Elizalde, Julia Bernd, G. Friedland

{"title":"基于音频的多媒体事件分类的神经网络研究","authors":"M. Ravanelli, Benjamin Elizalde, Julia Bernd, G. Friedland","doi":"10.1145/2814815.2814816","DOIUrl":null,"url":null,"abstract":"Multimedia Event Detection (MED) aims to identify events-also called scenes-in videos, such as a flash mob or a wedding ceremony. Audio content information complements cues such as visual content and text. In this paper, we explore the optimization of neural networks (NNs) for audio-based multimedia event classification, and discuss some insights towards more effectively using this paradigm for MED. We explore different architectures, in terms of number of layers and number of neurons. We also assess the performance impact of pre-training with Restricted Boltzmann Machines (RBMs) in contrast with random initialization, and explore the effect of varying the context window for the input to the NNs. Lastly, we compare the performance of Hidden Markov Models (HMMs) with a discriminative classifier for the event classification. We used the publicly available event-annotated YLI-MED dataset. Our results showed a performance improvement of more than 6% absolute accuracy compared to the latest results reported in the literature. Interestingly, these results were obtained with a single-layer neural network with random initialization, suggesting that standard approaches with deep learning and RBM pre-training are not fully adequate to address the high-level video event-classification task.","PeriodicalId":215083,"journal":{"name":"MMCommons '15","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Insights into Audio-Based Multimedia Event Classification with Neural Networks\",\"authors\":\"M. Ravanelli, Benjamin Elizalde, Julia Bernd, G. Friedland\",\"doi\":\"10.1145/2814815.2814816\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimedia Event Detection (MED) aims to identify events-also called scenes-in videos, such as a flash mob or a wedding ceremony. Audio content information complements cues such as visual content and text. In this paper, we explore the optimization of neural networks (NNs) for audio-based multimedia event classification, and discuss some insights towards more effectively using this paradigm for MED. We explore different architectures, in terms of number of layers and number of neurons. We also assess the performance impact of pre-training with Restricted Boltzmann Machines (RBMs) in contrast with random initialization, and explore the effect of varying the context window for the input to the NNs. Lastly, we compare the performance of Hidden Markov Models (HMMs) with a discriminative classifier for the event classification. We used the publicly available event-annotated YLI-MED dataset. Our results showed a performance improvement of more than 6% absolute accuracy compared to the latest results reported in the literature. Interestingly, these results were obtained with a single-layer neural network with random initialization, suggesting that standard approaches with deep learning and RBM pre-training are not fully adequate to address the high-level video event-classification task.\",\"PeriodicalId\":215083,\"journal\":{\"name\":\"MMCommons '15\",\"volume\":\"144 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"MMCommons '15\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2814815.2814816\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"MMCommons '15","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2814815.2814816","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

多媒体事件检测(MED)旨在识别视频中的事件(也称为场景)，例如快闪族或婚礼仪式。音频内容信息是视觉内容和文本等线索的补充。在本文中，我们探索了基于音频的多媒体事件分类的神经网络(nn)的优化，并讨论了如何更有效地将这种范式用于MED的一些见解。我们在层数和神经元数量方面探索了不同的架构。我们还评估了与随机初始化相比，使用受限玻尔兹曼机(rbm)进行预训练对性能的影响，并探讨了改变神经网络输入的上下文窗口的影响。最后，我们比较了隐马尔可夫模型(hmm)与判别分类器在事件分类方面的性能。我们使用了公开可用的带有事件注释的YLI-MED数据集。与文献中报道的最新结果相比，我们的结果显示了超过6%的绝对精度的性能改进。有趣的是，这些结果是通过随机初始化的单层神经网络获得的，这表明深度学习和RBM预训练的标准方法并不完全足以解决高级视频事件分类任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Insights into Audio-Based Multimedia Event Classification with Neural Networks

Multimedia Event Detection (MED) aims to identify events-also called scenes-in videos, such as a flash mob or a wedding ceremony. Audio content information complements cues such as visual content and text. In this paper, we explore the optimization of neural networks (NNs) for audio-based multimedia event classification, and discuss some insights towards more effectively using this paradigm for MED. We explore different architectures, in terms of number of layers and number of neurons. We also assess the performance impact of pre-training with Restricted Boltzmann Machines (RBMs) in contrast with random initialization, and explore the effect of varying the context window for the input to the NNs. Lastly, we compare the performance of Hidden Markov Models (HMMs) with a discriminative classifier for the event classification. We used the publicly available event-annotated YLI-MED dataset. Our results showed a performance improvement of more than 6% absolute accuracy compared to the latest results reported in the literature. Interestingly, these results were obtained with a single-layer neural network with random initialization, suggesting that standard approaches with deep learning and RBM pre-training are not fully adequate to address the high-level video event-classification task.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

MMCommons '15

自引率

0.00%

发文量