{"title":"Vision-and-Language Navigation via Latent Semantic Alignment Learning","authors":"Siying Wu;Xueyang Fu;Feng Wu;Zheng-Jun Zha","doi":"10.1109/TMM.2024.3358112","DOIUrl":null,"url":null,"abstract":"Vision-and-Language Navigation (VLN) requires that an agent can comprehensively understand the given instructions and the immediate visual information obtained from the environment, so as to make correct actions to achieve the navigation goal. Therefore, semantic alignment across modalities is crucial for the agent understanding its own state during the navigation process. However, the potential of semantic alignment has not been systematically explored in current studies, which limits the further improvement of navigation performance. To address this issue, we propose a new Latent Semantic Alignment Learning method to develop the semantically aligned relationships contained in the environment. Specifically, we introduce three novel pre-training tasks: Trajectory-conditioned Masked Fragment Modeling, Action Prediction of Masked Observation, and Hierarchical Triple Contrastive Learning. The first two tasks are used to reason about cross-modal dependencies, while the third one is able to learn semantically consistent representations across modalities. In this way, the Latent Semantic Alignment Learning method establishes a consistent perception of the environment and makes the agent's actions easier to explain. Experiments on common benchmarks verify the effectiveness of our proposed methods. For example, we improve the Success Rate by 1.6% on the R2R validation unseen set and 4.3% on the R4R validation unseen set over the baseline model.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"8406-8418"},"PeriodicalIF":9.7000,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10414007/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-and-Language Navigation (VLN) requires that an agent can comprehensively understand the given instructions and the immediate visual information obtained from the environment, so as to make correct actions to achieve the navigation goal. Therefore, semantic alignment across modalities is crucial for the agent understanding its own state during the navigation process. However, the potential of semantic alignment has not been systematically explored in current studies, which limits the further improvement of navigation performance. To address this issue, we propose a new Latent Semantic Alignment Learning method to develop the semantically aligned relationships contained in the environment. Specifically, we introduce three novel pre-training tasks: Trajectory-conditioned Masked Fragment Modeling, Action Prediction of Masked Observation, and Hierarchical Triple Contrastive Learning. The first two tasks are used to reason about cross-modal dependencies, while the third one is able to learn semantically consistent representations across modalities. In this way, the Latent Semantic Alignment Learning method establishes a consistent perception of the environment and makes the agent's actions easier to explain. Experiments on common benchmarks verify the effectiveness of our proposed methods. For example, we improve the Success Rate by 1.6% on the R2R validation unseen set and 4.3% on the R4R validation unseen set over the baseline model.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.