Sabanadesan Umakanthan, S. Denman, C. Fookes, S. Sridharan
{"title":"Supervised Latent Dirichlet Allocation Models for Efficient Activity Representation","authors":"Sabanadesan Umakanthan, S. Denman, C. Fookes, S. Sridharan","doi":"10.1109/DICTA.2014.7008130","DOIUrl":null,"url":null,"abstract":"Local spatio-temporal features with a Bag-of-visual words model is a popular approach used in human action recognition. Bag-of-features methods suffer from several challenges such as extracting appropriate appearance and motion features from videos, converting extracted features appropriate for classification and designing a suitable classification framework. In this paper we address the problem of efficiently representing the extracted features for classification to improve the overall performance. We introduce two generative supervised topic models, maximum entropy discrimination LDA (MedLDA) and class- specific simplex LDA (css-LDA), to encode the raw features suitable for discriminative SVM based classification. Unsupervised LDA models disconnect topic discovery from the classification task, hence yield poor results compared to the baseline Bag-of-words framework. On the other hand supervised LDA techniques learn the topic structure by considering the class labels and improve the recognition accuracy significantly. MedLDA maximizes likelihood and within class margins using max-margin techniques and yields a sparse highly discriminative topic structure; while in css-LDA separate class specific topics are learned instead of common set of topics across the entire dataset. In our representation first topics are learned and then each video is represented as a topic proportion vector, i.e. it can be comparable to a histogram of topics. Finally SVM classification is done on the learned topic proportion vector. We demonstrate the efficiency of the above two representation techniques through the experiments carried out in two popular datasets. Experimental results demonstrate significantly improved performance compared to the baseline Bag-of-features framework which uses kmeans to construct histogram of words from the feature vectors.","PeriodicalId":146695,"journal":{"name":"2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DICTA.2014.7008130","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Local spatio-temporal features with a Bag-of-visual words model is a popular approach used in human action recognition. Bag-of-features methods suffer from several challenges such as extracting appropriate appearance and motion features from videos, converting extracted features appropriate for classification and designing a suitable classification framework. In this paper we address the problem of efficiently representing the extracted features for classification to improve the overall performance. We introduce two generative supervised topic models, maximum entropy discrimination LDA (MedLDA) and class- specific simplex LDA (css-LDA), to encode the raw features suitable for discriminative SVM based classification. Unsupervised LDA models disconnect topic discovery from the classification task, hence yield poor results compared to the baseline Bag-of-words framework. On the other hand supervised LDA techniques learn the topic structure by considering the class labels and improve the recognition accuracy significantly. MedLDA maximizes likelihood and within class margins using max-margin techniques and yields a sparse highly discriminative topic structure; while in css-LDA separate class specific topics are learned instead of common set of topics across the entire dataset. In our representation first topics are learned and then each video is represented as a topic proportion vector, i.e. it can be comparable to a histogram of topics. Finally SVM classification is done on the learned topic proportion vector. We demonstrate the efficiency of the above two representation techniques through the experiments carried out in two popular datasets. Experimental results demonstrate significantly improved performance compared to the baseline Bag-of-features framework which uses kmeans to construct histogram of words from the feature vectors.
基于视觉词袋模型的局部时空特征是一种常用的人体动作识别方法。feature bag -of-feature方法面临着从视频中提取合适的外观和运动特征、将提取的特征转换为适合分类的特征以及设计合适的分类框架等挑战。在本文中,我们解决了有效地表示提取的特征进行分类的问题,以提高整体性能。引入最大熵判别LDA (MedLDA)和类特定单纯形LDA (css-LDA)两种生成式监督主题模型,对适合于判别支持向量机分类的原始特征进行编码。无监督LDA模型将主题发现与分类任务分离开来,因此与基线词袋框架相比产生较差的结果。另一方面,监督式LDA技术通过考虑类标签来学习主题结构,显著提高了识别准确率。MedLDA使用最大边界技术最大化似然和类边界,并产生稀疏的高度判别主题结构;而在css-LDA中,单独的类特定主题被学习,而不是整个数据集的公共主题集。在我们的表示中,首先学习主题,然后将每个视频表示为主题比例向量,即它可以与主题的直方图相比较。最后对学习到的主题比例向量进行SVM分类。我们通过在两个流行的数据集上进行的实验证明了上述两种表示技术的效率。实验结果表明,与使用kmeans从特征向量构建单词直方图的基线特征袋框架相比,性能有显著提高。