Online multimodal matrix factorization for human action video indexing

2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI) Pub Date : 2014-06-18 DOI:10.1109/CBMI.2014.6849823

F. Páez, Jorge A. Vanegas, F. González

{"title":"Online multimodal matrix factorization for human action video indexing","authors":"F. Páez, Jorge A. Vanegas, F. González","doi":"10.1109/CBMI.2014.6849823","DOIUrl":null,"url":null,"abstract":"This paper addresses the problem of searching for videos containing instances of specific human actions. The proposed strategy builds a multimodal latent space representation where both visual content and annotations are simultaneously mapped. The hypothesis behind the method is that such a latent space yields better results when built from multiple data modalities. The semantic embedding is learned using matrix factorization through stochastic gradient descent, which makes it suitable to deal with large-scale collections. The method is evaluated on a large-scale human action video dataset with three modalities corresponding to action labels, action attributes and visual features. The evaluation is based on a query-by-example strategy, where a sample video is used as input to the system. A retrieved video is considered relevant if it contains an instance of the same human action present in the query. Experimental results show that the learned multimodal latent semantic representation produces improved performance when compared with an exclusively visual representation.","PeriodicalId":103056,"journal":{"name":"2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMI.2014.6849823","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This paper addresses the problem of searching for videos containing instances of specific human actions. The proposed strategy builds a multimodal latent space representation where both visual content and annotations are simultaneously mapped. The hypothesis behind the method is that such a latent space yields better results when built from multiple data modalities. The semantic embedding is learned using matrix factorization through stochastic gradient descent, which makes it suitable to deal with large-scale collections. The method is evaluated on a large-scale human action video dataset with three modalities corresponding to action labels, action attributes and visual features. The evaluation is based on a query-by-example strategy, where a sample video is used as input to the system. A retrieved video is considered relevant if it contains an instance of the same human action present in the query. Experimental results show that the learned multimodal latent semantic representation produces improved performance when compared with an exclusively visual representation.

查看原文本刊更多论文

基于在线多模态矩阵分解的人体动作视频索引

本文解决了搜索包含特定人类行为实例的视频的问题。该策略建立了一个多模态潜在空间表示，其中视觉内容和注释同时被映射。该方法背后的假设是，当从多个数据模态构建时，这样的潜在空间产生更好的结果。语义嵌入采用随机梯度下降的矩阵分解学习，适合处理大规模集合。该方法在一个大规模人类动作视频数据集上进行了评估，该数据集具有动作标签、动作属性和视觉特征对应的三种模式。评估基于按例查询策略，其中示例视频用作系统的输入。如果检索到的视频包含查询中出现的相同人类行为的实例，则认为它是相关的。实验结果表明，学习得到的多模态潜在语义表示比单纯的视觉表示具有更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI)

自引率

0.00%

发文量