Violent Scene Detection Using a Super Descriptor Tensor Decomposition

2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA) Pub Date : 2015-11-01 DOI:10.1109/DICTA.2015.7371320

Muhammad Rizwan Khokher, A. Bouzerdoum, S. L. Phung

{"title":"Violent Scene Detection Using a Super Descriptor Tensor Decomposition","authors":"Muhammad Rizwan Khokher, A. Bouzerdoum, S. L. Phung","doi":"10.1109/DICTA.2015.7371320","DOIUrl":null,"url":null,"abstract":"This article presents a new method for violent scene detection using super descriptor tensor decomposition. Multi-modal local features comprising auditory and visual features are extracted from Mel-frequency cepstral coefficients (including first and second order derivatives) and refined dense trajectories. There is usually a large number of dense trajectories extracted from a video sequence; some of these trajectories are unnecessary and can affect the accuracy. We propose to refine the dense trajectories by selecting only discriminative trajectories in the region of interest. Visual descriptors consisting of oriented gradient and motion boundary histograms are computed along the refined dense trajectories. In traditional bag-of-visual-words techniques, the feature descriptors are concatenated to form a single large feature vector for classification. This destroys the spatio-temporal interactions among features extracted from multi-modal data. To address this problem, a super descriptor tensor decomposition is proposed. The extracted feature descriptors are first encoded using super descriptor vector method. Then the encoded features are arranged as tensors so as to retain the spatio-temporal structure of the features. To obtain a compact set of features for classification, the TUCKER-3 decomposition is applied to the super descriptor tensors, followed by feature selection using Fisher feature ranking. The obtained features are fed to a support vector machine classifier. Experimental evaluation is performed on violence detection benchmark dataset, MediaEval VSD2014. The proposed method outperforms most of the state-of-the-art methods, achieving MAP2014 scores of 60.2% and 67.8% on two subsets of the dataset.","PeriodicalId":214897,"journal":{"name":"2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DICTA.2015.7371320","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

This article presents a new method for violent scene detection using super descriptor tensor decomposition. Multi-modal local features comprising auditory and visual features are extracted from Mel-frequency cepstral coefficients (including first and second order derivatives) and refined dense trajectories. There is usually a large number of dense trajectories extracted from a video sequence; some of these trajectories are unnecessary and can affect the accuracy. We propose to refine the dense trajectories by selecting only discriminative trajectories in the region of interest. Visual descriptors consisting of oriented gradient and motion boundary histograms are computed along the refined dense trajectories. In traditional bag-of-visual-words techniques, the feature descriptors are concatenated to form a single large feature vector for classification. This destroys the spatio-temporal interactions among features extracted from multi-modal data. To address this problem, a super descriptor tensor decomposition is proposed. The extracted feature descriptors are first encoded using super descriptor vector method. Then the encoded features are arranged as tensors so as to retain the spatio-temporal structure of the features. To obtain a compact set of features for classification, the TUCKER-3 decomposition is applied to the super descriptor tensors, followed by feature selection using Fisher feature ranking. The obtained features are fed to a support vector machine classifier. Experimental evaluation is performed on violence detection benchmark dataset, MediaEval VSD2014. The proposed method outperforms most of the state-of-the-art methods, achieving MAP2014 scores of 60.2% and 67.8% on two subsets of the dataset.

查看原文本刊更多论文

使用超描述张量分解的暴力场景检测

本文提出了一种基于超描述子张量分解的暴力场景检测新方法。从mel频率倒谱系数(包括一阶和二阶导数)中提取包含听觉和视觉特征的多模态局部特征，并改进密集轨迹。通常从视频序列中提取大量密集的轨迹;其中一些轨迹是不必要的，可能会影响精度。我们建议通过只选择感兴趣区域的判别轨迹来细化密集轨迹。沿着精细的密集轨迹计算由定向梯度和运动边界直方图组成的视觉描述符。在传统的视觉词袋技术中，特征描述符被连接起来形成一个单一的大特征向量进行分类。这破坏了从多模态数据中提取的特征之间的时空相互作用。为了解决这个问题，提出了一个超描述子张量分解。首先用超描述子向量方法对提取的特征描述子进行编码。然后将编码后的特征排列成张量，以保留特征的时空结构。为了获得一个紧凑的特征集用于分类，将TUCKER-3分解应用于超描述子张量，然后使用Fisher特征排序进行特征选择。将得到的特征输入到支持向量机分类器中。在暴力检测基准数据集MediaEval VSD2014上进行了实验评估。该方法优于大多数最先进的方法，在数据集的两个子集上实现了60.2%和67.8%的MAP2014分数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

自引率

0.00%

发文量