用于多模态视频分类的深度强化学习视觉-文本注意

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI:10.1145/3347450.3357654

Mengyi Liu, Zhu Liu

{"title":"用于多模态视频分类的深度强化学习视觉-文本注意","authors":"Mengyi Liu, Zhu Liu","doi":"10.1145/3347450.3357654","DOIUrl":null,"url":null,"abstract":"Nowadays multimedia contents including text, images, and videos have been produced and shared ubiquitously in our daily life, which has encouraged researchers to develop algorithms for multimedia search and analysis in various applications. The trend of web data becoming increasingly multimodal makes the task of multimodal classification ever more popular and pertinent. In this paper, we mainly focus on the scenario of videos for their intrinsic multimodal property, and resort to attention learning among different modalities for classification. Specifically, we formulate the multimodal attention learning as a sequential decision-making process, and propose an end-to-end, deep reinforcement learning based framework to determine the selection of modality at each time step for the final feature aggregation model. To train our policy networks, we design a supervised reward which considers the multi-label classification loss, and two unsupervised rewards which simultaneously consider inter-modality correlation for consistency and intra-modality reconstruction for representativeness. Extensive experiments have been conducted on two large-scale multimodal video datasets to evaluate the whole framework and several key components, including the parameters of policy network, the effects of different rewards, and the rationality of the learned visual-text attention. Promising results demonstrate that our approach outperforms other state-of-the-art methods of attention mechanism and multimodal fusion for video classification task.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep Reinforcement Learning Visual-Text Attention for Multimodal Video Classification\",\"authors\":\"Mengyi Liu, Zhu Liu\",\"doi\":\"10.1145/3347450.3357654\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays multimedia contents including text, images, and videos have been produced and shared ubiquitously in our daily life, which has encouraged researchers to develop algorithms for multimedia search and analysis in various applications. The trend of web data becoming increasingly multimodal makes the task of multimodal classification ever more popular and pertinent. In this paper, we mainly focus on the scenario of videos for their intrinsic multimodal property, and resort to attention learning among different modalities for classification. Specifically, we formulate the multimodal attention learning as a sequential decision-making process, and propose an end-to-end, deep reinforcement learning based framework to determine the selection of modality at each time step for the final feature aggregation model. To train our policy networks, we design a supervised reward which considers the multi-label classification loss, and two unsupervised rewards which simultaneously consider inter-modality correlation for consistency and intra-modality reconstruction for representativeness. Extensive experiments have been conducted on two large-scale multimodal video datasets to evaluate the whole framework and several key components, including the parameters of policy network, the effects of different rewards, and the rationality of the learned visual-text attention. Promising results demonstrate that our approach outperforms other state-of-the-art methods of attention mechanism and multimodal fusion for video classification task.\",\"PeriodicalId\":329495,\"journal\":{\"name\":\"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3347450.3357654\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3347450.3357654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多媒体内容包括文本、图像和视频在我们的日常生活中无处不在的产生和共享，这促使研究人员开发各种应用的多媒体搜索和分析算法。网络数据日益多模态化的趋势使得多模态分类任务变得越来越流行和有针对性。本文主要针对视频场景本身的多模态特性，利用不同模态间的注意学习进行分类。具体来说，我们将多模态注意力学习描述为一个顺序决策过程，并提出了一个基于端到端的深度强化学习框架，以确定最终特征聚合模型在每个时间步上的模态选择。为了训练我们的策略网络，我们设计了一个考虑多标签分类损失的监督奖励，以及两个同时考虑模态间相关性的一致性和模态内重构的代表性的无监督奖励。在两个大规模的多模态视频数据集上进行了大量的实验，以评估整个框架和几个关键组成部分，包括政策网络的参数、不同奖励的效果以及学习到的视觉文本注意的合理性。令人鼓舞的结果表明，我们的方法在视频分类任务中优于其他最先进的注意力机制和多模态融合方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep Reinforcement Learning Visual-Text Attention for Multimodal Video Classification

Nowadays multimedia contents including text, images, and videos have been produced and shared ubiquitously in our daily life, which has encouraged researchers to develop algorithms for multimedia search and analysis in various applications. The trend of web data becoming increasingly multimodal makes the task of multimodal classification ever more popular and pertinent. In this paper, we mainly focus on the scenario of videos for their intrinsic multimodal property, and resort to attention learning among different modalities for classification. Specifically, we formulate the multimodal attention learning as a sequential decision-making process, and propose an end-to-end, deep reinforcement learning based framework to determine the selection of modality at each time step for the final feature aggregation model. To train our policy networks, we design a supervised reward which considers the multi-label classification loss, and two unsupervised rewards which simultaneously consider inter-modality correlation for consistency and intra-modality reconstruction for representativeness. Extensive experiments have been conducted on two large-scale multimodal video datasets to evaluate the whole framework and several key components, including the parameters of policy network, the effects of different rewards, and the rationality of the learned visual-text attention. Promising results demonstrate that our approach outperforms other state-of-the-art methods of attention mechanism and multimodal fusion for video classification task.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

自引率

0.00%

发文量