Towards a Knowledge-Based Approach for Generating Video Descriptions

2017 14th Conference on Computer and Robot Vision (CRV) Pub Date : 2017-05-16 DOI:10.1109/CRV.2017.51

Sathyanarayanan N. Aakur, F. Souza, Sudeep Sarkar

{"title":"Towards a Knowledge-Based Approach for Generating Video Descriptions","authors":"Sathyanarayanan N. Aakur, F. Souza, Sudeep Sarkar","doi":"10.1109/CRV.2017.51","DOIUrl":null,"url":null,"abstract":"Existent video description approaches advocated in the literature rely on capturing the semantic relationships among concepts and visual features from training data specific to various datasets. Naturally, their success at generalizing the video descriptions for the domain is closely dependent on the availability, representativeness, size and annotation quality of the training data. Common issues are overfitting, the amount of training data and computational time required for the model. To overcome these issues, we propose to alleviate the learning of semantic knowledge from domain-specific datasets by leveraging general human knowledge sources such as ConceptNet. We propose the use of ConceptNet as the source of knowledge for generating video descriptions using Grenander's pattern theory formalism. Instead of relying on training data to estimate semantic compatibility of two concepts, we use weights in the ConceptNet that determines the degree of validity of the assertion between two concepts based on the knowledge sources. We test and compare this idea on the task of generating semantically coherent descriptions for videos from the Breakfast Actions and Carnegie Mellon's Multimodal activities dataset. In comparison with other approaches, the proposed method achieves comparable accuracy against state-of-the-art methods based on HMMs and CFGs and generate semantically coherent descriptions even when presented with inconsistent action and object labels. We are also able to show that the proposed approach performs comparably with models trained on domain-specific data.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th Conference on Computer and Robot Vision (CRV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CRV.2017.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Existent video description approaches advocated in the literature rely on capturing the semantic relationships among concepts and visual features from training data specific to various datasets. Naturally, their success at generalizing the video descriptions for the domain is closely dependent on the availability, representativeness, size and annotation quality of the training data. Common issues are overfitting, the amount of training data and computational time required for the model. To overcome these issues, we propose to alleviate the learning of semantic knowledge from domain-specific datasets by leveraging general human knowledge sources such as ConceptNet. We propose the use of ConceptNet as the source of knowledge for generating video descriptions using Grenander's pattern theory formalism. Instead of relying on training data to estimate semantic compatibility of two concepts, we use weights in the ConceptNet that determines the degree of validity of the assertion between two concepts based on the knowledge sources. We test and compare this idea on the task of generating semantically coherent descriptions for videos from the Breakfast Actions and Carnegie Mellon's Multimodal activities dataset. In comparison with other approaches, the proposed method achieves comparable accuracy against state-of-the-art methods based on HMMs and CFGs and generate semantically coherent descriptions even when presented with inconsistent action and object labels. We are also able to show that the proposed approach performs comparably with models trained on domain-specific data.

查看原文本刊更多论文

基于知识的视频描述生成方法研究

现有文献中倡导的视频描述方法依赖于捕获特定于各种数据集的训练数据中概念之间的语义关系和视觉特征。当然，他们在推广该领域视频描述方面的成功与训练数据的可用性、代表性、大小和注释质量密切相关。常见的问题是过拟合，训练数据量和模型所需的计算时间。为了克服这些问题，我们建议通过利用一般的人类知识来源(如ConceptNet)来减轻从特定领域数据集中学习语义知识的问题。我们建议使用ConceptNet作为使用Grenander模式理论形式主义生成视频描述的知识来源。我们不再依赖训练数据来估计两个概念的语义兼容性，而是在概念网络中使用权重来确定基于知识来源的两个概念之间断言的有效性程度。我们在为早餐行动和卡内基梅隆大学的多模态活动数据集的视频生成语义连贯描述的任务上测试并比较了这个想法。与其他方法相比，所提出的方法与基于hmm和CFGs的最先进方法相比具有相当的准确性，并且即使在呈现不一致的动作和对象标签时也能生成语义连贯的描述。我们还能够证明所提出的方法与在特定领域数据上训练的模型的性能相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 14th Conference on Computer and Robot Vision (CRV)

自引率

0.00%

发文量