Spatio-temporal fisher vector coding for surveillance event detection

Qiang Chen, Yang Cai, L. Brown, A. Datta, Quanfu Fan, R. Feris, Shuicheng Yan, Alexander Hauptmann, Sharath Pankanti
{"title":"Spatio-temporal fisher vector coding for surveillance event detection","authors":"Qiang Chen, Yang Cai, L. Brown, A. Datta, Quanfu Fan, R. Feris, Shuicheng Yan, Alexander Hauptmann, Sharath Pankanti","doi":"10.1145/2502081.2502155","DOIUrl":null,"url":null,"abstract":"We present a generic event detection system evaluated in the Surveillance Event Detection (SED) task of TRECVID 2012. We investigate a statistical approach with spatio-temporal features applied to seven event classes, which were defined by the SED task. This approach is based on local spatio-temporal descriptors, called MoSIFT and generated by pair-wise video frames. A Gaussian Mixture Model(GMM) is learned to model the distribution of the low level features. Then for each sliding window, the Fisher vector encoding [improvedFV] is used to generate the sample representation. The model is learnt using a Linear SVM for each event. The main novelty of our system is the introduction of Fisher vector encoding into video event detection. Fisher vector encoding has demonstrated great success in image classification. The key idea is to model the low level visual features as a Gaussian Mixture Model and to generate an intermediate vector representation for bag of features. FV encoding uses higher order statistics in place of histograms in the standard BoW. FV has several good properties: (a) it can naturally separate the video specific information from the noisy local features and (b) we can use a linear model for this representation. We build an efficient implementation for FV encoding which can attain a 10 times speed-up over real-time. We also take advantage of non-trivial object localization techniques to feed into the video event detection, e.g. multi-scale detection and non-maximum suppression. This approach outperformed the results of all other teams submissions in TRECVID SED 2012 on four of the seven event types.","PeriodicalId":20448,"journal":{"name":"Proceedings of the 21st ACM international conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2502081.2502155","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

Abstract

We present a generic event detection system evaluated in the Surveillance Event Detection (SED) task of TRECVID 2012. We investigate a statistical approach with spatio-temporal features applied to seven event classes, which were defined by the SED task. This approach is based on local spatio-temporal descriptors, called MoSIFT and generated by pair-wise video frames. A Gaussian Mixture Model(GMM) is learned to model the distribution of the low level features. Then for each sliding window, the Fisher vector encoding [improvedFV] is used to generate the sample representation. The model is learnt using a Linear SVM for each event. The main novelty of our system is the introduction of Fisher vector encoding into video event detection. Fisher vector encoding has demonstrated great success in image classification. The key idea is to model the low level visual features as a Gaussian Mixture Model and to generate an intermediate vector representation for bag of features. FV encoding uses higher order statistics in place of histograms in the standard BoW. FV has several good properties: (a) it can naturally separate the video specific information from the noisy local features and (b) we can use a linear model for this representation. We build an efficient implementation for FV encoding which can attain a 10 times speed-up over real-time. We also take advantage of non-trivial object localization techniques to feed into the video event detection, e.g. multi-scale detection and non-maximum suppression. This approach outperformed the results of all other teams submissions in TRECVID SED 2012 on four of the seven event types.
监测事件检测的时空fisher矢量编码
我们提出了一个通用的事件检测系统,在TRECVID 2012的监视事件检测(SED)任务中进行了评估。我们研究了一种具有时空特征的统计方法,应用于SED任务定义的七个事件类。该方法基于局部时空描述符,称为MoSIFT,并由成对视频帧生成。学习了高斯混合模型(GMM)来模拟低层特征的分布。然后,对于每个滑动窗口,使用Fisher矢量编码[improvedFV]来生成样本表示。对每个事件使用线性支持向量机学习模型。本系统的主要新颖之处在于将Fisher矢量编码引入视频事件检测中。费雪矢量编码在图像分类中取得了巨大的成功。关键思想是将低级视觉特征建模为高斯混合模型,并为特征包生成中间向量表示。FV编码使用高阶统计量来代替标准BoW中的直方图。FV有几个很好的特性:(a)它可以自然地将视频特定信息从嘈杂的局部特征中分离出来;(b)我们可以使用线性模型来表示这种特征。我们构建了一个有效的FV编码实现,可以实现10倍的实时加速。我们还利用非平凡的目标定位技术,如多尺度检测和非最大值抑制,来为视频事件检测提供信息。该方法在7种事件类型中的4种上优于TRECVID SED 2012中所有其他团队提交的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信