{"title":"Towards a Knowledge-Based Approach for Generating Video Descriptions","authors":"Sathyanarayanan N. Aakur, F. Souza, Sudeep Sarkar","doi":"10.1109/CRV.2017.51","DOIUrl":null,"url":null,"abstract":"Existent video description approaches advocated in the literature rely on capturing the semantic relationships among concepts and visual features from training data specific to various datasets. Naturally, their success at generalizing the video descriptions for the domain is closely dependent on the availability, representativeness, size and annotation quality of the training data. Common issues are overfitting, the amount of training data and computational time required for the model. To overcome these issues, we propose to alleviate the learning of semantic knowledge from domain-specific datasets by leveraging general human knowledge sources such as ConceptNet. We propose the use of ConceptNet as the source of knowledge for generating video descriptions using Grenander's pattern theory formalism. Instead of relying on training data to estimate semantic compatibility of two concepts, we use weights in the ConceptNet that determines the degree of validity of the assertion between two concepts based on the knowledge sources. We test and compare this idea on the task of generating semantically coherent descriptions for videos from the Breakfast Actions and Carnegie Mellon's Multimodal activities dataset. In comparison with other approaches, the proposed method achieves comparable accuracy against state-of-the-art methods based on HMMs and CFGs and generate semantically coherent descriptions even when presented with inconsistent action and object labels. We are also able to show that the proposed approach performs comparably with models trained on domain-specific data.","PeriodicalId":308760,"journal":{"name":"2017 14th Conference on Computer and Robot Vision (CRV)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 14th Conference on Computer and Robot Vision (CRV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CRV.2017.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Existent video description approaches advocated in the literature rely on capturing the semantic relationships among concepts and visual features from training data specific to various datasets. Naturally, their success at generalizing the video descriptions for the domain is closely dependent on the availability, representativeness, size and annotation quality of the training data. Common issues are overfitting, the amount of training data and computational time required for the model. To overcome these issues, we propose to alleviate the learning of semantic knowledge from domain-specific datasets by leveraging general human knowledge sources such as ConceptNet. We propose the use of ConceptNet as the source of knowledge for generating video descriptions using Grenander's pattern theory formalism. Instead of relying on training data to estimate semantic compatibility of two concepts, we use weights in the ConceptNet that determines the degree of validity of the assertion between two concepts based on the knowledge sources. We test and compare this idea on the task of generating semantically coherent descriptions for videos from the Breakfast Actions and Carnegie Mellon's Multimodal activities dataset. In comparison with other approaches, the proposed method achieves comparable accuracy against state-of-the-art methods based on HMMs and CFGs and generate semantically coherent descriptions even when presented with inconsistent action and object labels. We are also able to show that the proposed approach performs comparably with models trained on domain-specific data.