Violence detection in hollywood movies by the fusion of visual and mid-level audio cues

Proceedings of the 21st ACM international conference on Multimedia Pub Date : 2013-10-21 DOI:10.1145/2502081.2502187

Esra Acar, F. Hopfgartner, S. Albayrak

引用次数: 24

Abstract

Detecting violent scenes in movies is an important video content understanding functionality e.g., for providing automated youth protection services. One key issue in designing algorithms for violence detection is the choice of discriminative features. In this paper, we employ mid-level audio features and compare their discriminative power against low-level audio and visual features. We fuse these mid-level audio cues with low-level visual ones at the decision level in order to further improve the performance of violence detection. We use Mel-Frequency Cepstral Coefficients (MFCC) as audio and average motion as visual features. In order to learn a violence model, we choose two-class support vector machines (SVMs). Our experimental results on detecting violent video shots in Hollywood movies show that mid-level audio features are more discriminative and provide more precise results than low-level ones. The detection performance is further enhanced by fusing the mid-level audio cues with low-level visual ones using an SVM-based decision fusion.

查看原文本刊更多论文

好莱坞电影中融合视觉和中级音频线索的暴力侦查

检测电影中的暴力场景是一项重要的视频内容理解功能，例如提供自动青少年保护服务。设计暴力检测算法的一个关键问题是判别特征的选择。在本文中，我们采用了中级音频特征，并比较了它们与低级音频和视觉特征的鉴别能力。为了进一步提高暴力检测的性能，我们在决策层面将这些中级音频线索与低级视觉线索融合在一起。我们使用Mel-Frequency倒谱系数(MFCC)作为音频特征，平均运动作为视觉特征。为了学习暴力模型，我们选择了两类支持向量机(svm)。我们对好莱坞电影中暴力视频镜头的检测实验结果表明，中级音频特征比低级音频特征更具歧视性，提供的结果更精确。通过基于支持向量机的决策融合，将中级音频线索与低级视觉线索融合，进一步提高了检测性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 21st ACM international conference on Multimedia

自引率

0.00%

发文量