An Audio-Visual Dataset and Deep Learning Frameworks for Crowded Scene Classification

Proceedings of the 19th International Conference on Content-based Multimedia Indexing Pub Date : 2021-12-16 DOI:10.1145/3549555.3549568

L. Pham, D. Ngo, Phu X. Nguyen, Hoang Van Truong, Alexander Schindler

引用次数: 5

Abstract

In this paper, we present the task of audio-visual scene classification (SC) where input videos are classified into one of five real-life crowded scenes: ‘Riot’, ‘Noise-Street’, ‘Firework-Event’, ‘Music-Event’, and ‘Sport-Atmosphere’. To this end, we firstly collect an audio-visual dataset (videos) of these five crowded contexts from Youtube (in-the-wild scenes). Then, a wide range of deep learning classification models are proposed to train either audio or visual input data independently. Finally, results obtained from high-performance models are fused to achieve the best accuracy score. Our experimental results indicate that audio and visual input factors independently contribute to the SC task’s performance. Notably, an ensemble of deep learning models can achieve the best accuracy of 95.7%.

查看原文本刊更多论文

一个用于拥挤场景分类的视听数据集和深度学习框架

在本文中，我们提出了视听场景分类(SC)的任务，其中输入视频被分类为五个现实生活中的拥挤场景之一:“骚乱”，“噪音街道”，“烟花事件”，“音乐事件”和“体育氛围”。为此，我们首先从Youtube(野外场景)上收集了这五种拥挤环境的视听数据集(视频)。然后，提出了广泛的深度学习分类模型来独立训练音频或视觉输入数据。最后，将高性能模型得到的结果进行融合，以获得最佳精度分数。实验结果表明，音频和视觉输入因素对SC任务的表现有独立的影响。值得注意的是，深度学习模型的集合可以达到95.7%的最佳准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 19th International Conference on Content-based Multimedia Indexing

自引率

0.00%

发文量