视觉语言导航的跨模态语义对齐预训练

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI:10.1145/3503161.3548283

Siying Wu, Xueyang Fu, Feng Wu, Zhengjun Zha

{"title":"视觉语言导航的跨模态语义对齐预训练","authors":"Siying Wu, Xueyang Fu, Feng Wu, Zhengjun Zha","doi":"10.1145/3503161.3548283","DOIUrl":null,"url":null,"abstract":"Vision-and-Language Navigation needs an agent to navigate to a target location by progressively grounding and following the relevant instruction conditioning on its memory and current observation. Existing works utilize the cross-modal transformer to pass the message between visual modality and textual modality. However, they are still limited to mining the fine-grained matching between the underlying components of trajectories and instructions. Inspired by the significant progress achieved by large-scale pre-training methods, in this paper, we propose CSAP, a new method of Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation. It is designed to learn the alignment from trajectory-instruction pairs through two novel tasks, including trajectory-conditioned masked fragment modeling and contrastive semantic-alignment modeling. Specifically, the trajectory-conditioned masked fragment modeling encourages the agent to extract useful visual information to reconstruct the masked fragment. The contrastive semantic-alignment modeling is designed to align the visual representation with corresponding phrase embeddings. By showing experimental results on the benchmark dataset, we demonstrate that transformer architecture-based navigation agent pre-trained with our proposed CSAP outperforms existing methods on both SR and SPL scores.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation\",\"authors\":\"Siying Wu, Xueyang Fu, Feng Wu, Zhengjun Zha\",\"doi\":\"10.1145/3503161.3548283\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision-and-Language Navigation needs an agent to navigate to a target location by progressively grounding and following the relevant instruction conditioning on its memory and current observation. Existing works utilize the cross-modal transformer to pass the message between visual modality and textual modality. However, they are still limited to mining the fine-grained matching between the underlying components of trajectories and instructions. Inspired by the significant progress achieved by large-scale pre-training methods, in this paper, we propose CSAP, a new method of Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation. It is designed to learn the alignment from trajectory-instruction pairs through two novel tasks, including trajectory-conditioned masked fragment modeling and contrastive semantic-alignment modeling. Specifically, the trajectory-conditioned masked fragment modeling encourages the agent to extract useful visual information to reconstruct the masked fragment. The contrastive semantic-alignment modeling is designed to align the visual representation with corresponding phrase embeddings. By showing experimental results on the benchmark dataset, we demonstrate that transformer architecture-based navigation agent pre-trained with our proposed CSAP outperforms existing methods on both SR and SPL scores.\",\"PeriodicalId\":412792,\"journal\":{\"name\":\"Proceedings of the 30th ACM International Conference on Multimedia\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 30th ACM International Conference on Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3503161.3548283\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3548283","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

视觉语言导航需要智能体根据其记忆和当前观察，逐步接地并遵循相关指令，从而导航到目标位置。现有作品利用跨模态转换器在视觉模态和文本模态之间传递信息。然而，它们仍然局限于挖掘轨迹和指令的底层组件之间的细粒度匹配。受大规模预训练方法取得的重大进展的启发，本文提出了一种新的视觉语言导航跨模态语义对齐预训练方法CSAP。该算法通过轨迹条件屏蔽片段建模和对比语义对齐建模两种新颖的任务，从轨迹-指令对中学习对齐。具体来说，轨迹条件屏蔽片段建模鼓励智能体提取有用的视觉信息来重建被屏蔽的片段。对比语义对齐建模旨在将视觉表示与相应的短语嵌入对齐。通过在基准数据集上展示实验结果，我们证明了使用我们提出的CSAP预训练的基于变压器体系结构的导航代理在SR和SPL分数上都优于现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation

Vision-and-Language Navigation needs an agent to navigate to a target location by progressively grounding and following the relevant instruction conditioning on its memory and current observation. Existing works utilize the cross-modal transformer to pass the message between visual modality and textual modality. However, they are still limited to mining the fine-grained matching between the underlying components of trajectories and instructions. Inspired by the significant progress achieved by large-scale pre-training methods, in this paper, we propose CSAP, a new method of Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation. It is designed to learn the alignment from trajectory-instruction pairs through two novel tasks, including trajectory-conditioned masked fragment modeling and contrastive semantic-alignment modeling. Specifically, the trajectory-conditioned masked fragment modeling encourages the agent to extract useful visual information to reconstruct the masked fragment. The contrastive semantic-alignment modeling is designed to align the visual representation with corresponding phrase embeddings. By showing experimental results on the benchmark dataset, we demonstrate that transformer architecture-based navigation agent pre-trained with our proposed CSAP outperforms existing methods on both SR and SPL scores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 30th ACM International Conference on Multimedia

自引率

0.00%

发文量