视频事件检索的属性包

2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) Pub Date : 2016-07-18 DOI:10.1109/SIBGRAPI.2018.00064

Leonardo A. Duarte, O. A. B. Penatti, J. Almeida

{"title":"视频事件检索的属性包","authors":"Leonardo A. Duarte, O. A. B. Penatti, J. Almeida","doi":"10.1109/SIBGRAPI.2018.00064","DOIUrl":null,"url":null,"abstract":"In this paper, we present the Bag-of-Attributes (BoA) model for video representation aiming at video event retrieval. The BoA model is based on a semantic feature space for representing videos, resulting in high-level video feature vectors. For creating a semantic space, i.e., the attribute space, we can train a classifier using a labeled image dataset, obtaining a classification model that can be understood as a high-level codebook. This model is used to map low-level frame vectors into high-level vectors (e.g., classifier probability scores). Then, we apply pooling operations to the frame vectors to create the final bag of attributes for the video. In the BoA representation, each dimension corresponds to one category (or attribute) of the semantic space. Other interesting properties are: compactness, flexibility regarding the classifier, and ability to encode multiple semantic concepts in a single video representation. Our experiments considered the semantic space created by state-of-the-art convolutional neural networks pre-trained on 1000 object categories of ImageNet. Such deep neural networks were used to classify each video frame and then different coding strategies were used to encode the probability distribution from the softmax layer into a frame vector. Next, different pooling strategies were used to combine frame vectors in the BoA representation for a video. Results using BoA were comparable or superior to the baselines in the task of video event retrieval using the EVVE dataset, with the advantage of providing a much more compact representation.","PeriodicalId":208985,"journal":{"name":"2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Bag of Attributes for Video Event Retrieval\",\"authors\":\"Leonardo A. Duarte, O. A. B. Penatti, J. Almeida\",\"doi\":\"10.1109/SIBGRAPI.2018.00064\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we present the Bag-of-Attributes (BoA) model for video representation aiming at video event retrieval. The BoA model is based on a semantic feature space for representing videos, resulting in high-level video feature vectors. For creating a semantic space, i.e., the attribute space, we can train a classifier using a labeled image dataset, obtaining a classification model that can be understood as a high-level codebook. This model is used to map low-level frame vectors into high-level vectors (e.g., classifier probability scores). Then, we apply pooling operations to the frame vectors to create the final bag of attributes for the video. In the BoA representation, each dimension corresponds to one category (or attribute) of the semantic space. Other interesting properties are: compactness, flexibility regarding the classifier, and ability to encode multiple semantic concepts in a single video representation. Our experiments considered the semantic space created by state-of-the-art convolutional neural networks pre-trained on 1000 object categories of ImageNet. Such deep neural networks were used to classify each video frame and then different coding strategies were used to encode the probability distribution from the softmax layer into a frame vector. Next, different pooling strategies were used to combine frame vectors in the BoA representation for a video. Results using BoA were comparable or superior to the baselines in the task of video event retrieval using the EVVE dataset, with the advantage of providing a much more compact representation.\",\"PeriodicalId\":208985,\"journal\":{\"name\":\"2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)\",\"volume\":\"68 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIBGRAPI.2018.00064\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIBGRAPI.2018.00064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

针对视频事件检索问题，提出了一种基于属性袋(BoA)的视频表示模型。BoA模型基于表示视频的语义特征空间，从而产生高级视频特征向量。为了创建语义空间，即属性空间，我们可以使用标记的图像数据集训练分类器，获得可以理解为高级码本的分类模型。该模型用于将低级帧向量映射到高级向量(例如，分类器概率得分)。然后，我们对帧矢量应用池操作，为视频创建最终的属性包。在BoA表示中，每个维度对应于语义空间的一个类别(或属性)。其他有趣的特性是:紧凑性、关于分类器的灵活性，以及在单个视频表示中编码多个语义概念的能力。我们的实验考虑了由最先进的卷积神经网络在ImageNet的1000个对象类别上预训练产生的语义空间。利用深度神经网络对视频帧进行分类，然后采用不同的编码策略将softmax层的概率分布编码为帧向量。接下来，使用不同的池化策略来组合视频BoA表示中的帧向量。在使用EVVE数据集的视频事件检索任务中，使用BoA的结果与基线相当或优于基线，其优势是提供了更紧凑的表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bag of Attributes for Video Event Retrieval

In this paper, we present the Bag-of-Attributes (BoA) model for video representation aiming at video event retrieval. The BoA model is based on a semantic feature space for representing videos, resulting in high-level video feature vectors. For creating a semantic space, i.e., the attribute space, we can train a classifier using a labeled image dataset, obtaining a classification model that can be understood as a high-level codebook. This model is used to map low-level frame vectors into high-level vectors (e.g., classifier probability scores). Then, we apply pooling operations to the frame vectors to create the final bag of attributes for the video. In the BoA representation, each dimension corresponds to one category (or attribute) of the semantic space. Other interesting properties are: compactness, flexibility regarding the classifier, and ability to encode multiple semantic concepts in a single video representation. Our experiments considered the semantic space created by state-of-the-art convolutional neural networks pre-trained on 1000 object categories of ImageNet. Such deep neural networks were used to classify each video frame and then different coding strategies were used to encode the probability distribution from the softmax layer into a frame vector. Next, different pooling strategies were used to combine frame vectors in the BoA representation for a video. Results using BoA were comparable or superior to the baselines in the task of video event retrieval using the EVVE dataset, with the advantage of providing a much more compact representation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)

自引率

0.00%

发文量