C3VQG: category consistent cyclic visual question generation

Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2020-05-15 DOI:10.1145/3444685.3446302

Shagun Uppal, Anish Madan, Sarthak Bhagat, Yi Yu, R. Shah

{"title":"C3VQG: category consistent cyclic visual question generation","authors":"Shagun Uppal, Anish Madan, Sarthak Bhagat, Yi Yu, R. Shah","doi":"10.1145/3444685.3446302","DOIUrl":null,"url":null,"abstract":"Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which have demonstrated meaningful generated questions given an image and its associated ground-truth answer. VQG becomes more challenging if the image contains rich contextual information describing its different semantic categories. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations. Most importantly, by eliminating expensive answer annotations, the required supervision is weakened. Using different categories enables us to exploit different concepts as the inference requires only the image and the category. Mutual information is maximized between the image, question, and answer category in the latent space of our VAE. A novel category consistent cyclic loss is proposed to enable the model to generate consistent predictions with respect to the answer category, reducing redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Through extensive experiments, the proposed model, C3VQG outperforms state-of-the-art VQG methods with weak supervision.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3444685.3446302","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which have demonstrated meaningful generated questions given an image and its associated ground-truth answer. VQG becomes more challenging if the image contains rich contextual information describing its different semantic categories. In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder (VAE) without ground-truth answers. Our approach solves two major shortcomings of existing VQG systems: (i) minimize the level of supervision and (ii) replace generic questions with category relevant generations. Most importantly, by eliminating expensive answer annotations, the required supervision is weakened. Using different categories enables us to exploit different concepts as the inference requires only the image and the category. Mutual information is maximized between the image, question, and answer category in the latent space of our VAE. A novel category consistent cyclic loss is proposed to enable the model to generate consistent predictions with respect to the answer category, reducing redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Through extensive experiments, the proposed model, C3VQG outperforms state-of-the-art VQG methods with weak supervision.

查看原文本刊更多论文

C3VQG:类别一致循环可视化问题生成

视觉问题生成(VQG)是基于图像生成自然问题的任务。过去流行的方法已经探索了用最大似然训练的图像到序列架构，这些架构已经证明了给定图像及其相关的基础真值答案的有意义的生成问题。如果图像包含描述其不同语义类别的丰富上下文信息，VQG将变得更具挑战性。在本文中，我们尝试利用图像中的不同视觉线索和概念，使用变分自编码器(VAE)生成问题，而不需要真实答案。我们的方法解决了现有VQG系统的两个主要缺点:(i)最小化监督水平;(ii)用类别相关代替换通用问题。最重要的是，通过消除昂贵的答案注释，所需的监督被削弱了。使用不同的范畴使我们能够利用不同的概念，因为推理只需要图像和范畴。在我们的VAE潜在空间中，图像、问题和答案类别之间的互信息最大化。提出了一种新的类别一致循环损失，使模型能够产生关于答案类别的一致预测，减少冗余和不规则性。此外，我们还对生成模型的潜在空间施加补充约束，以提供基于类别的结构，并通过在每个维度内封装去相关特征来增强泛化。通过大量的实验，C3VQG模型优于目前最先进的弱监督VQG方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

自引率

0.00%

发文量