语言、视觉和行动在一起会更好

Companion Proceedings of the Web Conference 2021 Pub Date : 2021-04-19 DOI:10.1145/3442442.3451897

Jason Baldridge

{"title":"语言、视觉和行动在一起会更好","authors":"Jason Baldridge","doi":"10.1145/3442442.3451897","DOIUrl":null,"url":null,"abstract":"Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language-including speech-in the context of other modalities and environments is needed, and there has never been a better time to do it. Without ever invoking the worn-out, overblown phrase ”how babies learn” in the talk, I’ll cover three of my team’s efforts involving language, vision and action. First: our work on speech-image representation learning and retrieval, where we demonstrate settings in which directly encoding speech outperforms the hard-to-beat strategy of using automatic speech recognition and strong text encoders. Second: two models for text-to-image generation: a multi-stage model which exploits user-guidance in the form of mouse traces and a single-stage one which uses cross-modal contrastive losses. Third: Room-across-Room, a multilingual dataset for vision-and-language navigation, for which we collected spoken navigation instructions, high-quality text transcriptions, and fine-grained alignments between words and pixels in high-definition 360-degree panoramas. I’ll wrap up with some thoughts on how work on computational language grounding more broadly presents new opportunities to enhance and advance our scientific understanding of language and its fundamental role in human intelligence.","PeriodicalId":129420,"journal":{"name":"Companion Proceedings of the Web Conference 2021","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Language, Vision and Action are Better Together\",\"authors\":\"Jason Baldridge\",\"doi\":\"10.1145/3442442.3451897\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language-including speech-in the context of other modalities and environments is needed, and there has never been a better time to do it. Without ever invoking the worn-out, overblown phrase ”how babies learn” in the talk, I’ll cover three of my team’s efforts involving language, vision and action. First: our work on speech-image representation learning and retrieval, where we demonstrate settings in which directly encoding speech outperforms the hard-to-beat strategy of using automatic speech recognition and strong text encoders. Second: two models for text-to-image generation: a multi-stage model which exploits user-guidance in the form of mouse traces and a single-stage one which uses cross-modal contrastive losses. Third: Room-across-Room, a multilingual dataset for vision-and-language navigation, for which we collected spoken navigation instructions, high-quality text transcriptions, and fine-grained alignments between words and pixels in high-definition 360-degree panoramas. I’ll wrap up with some thoughts on how work on computational language grounding more broadly presents new opportunities to enhance and advance our scientific understanding of language and its fundamental role in human intelligence.\",\"PeriodicalId\":129420,\"journal\":{\"name\":\"Companion Proceedings of the Web Conference 2021\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Companion Proceedings of the Web Conference 2021\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3442442.3451897\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the Web Conference 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3442442.3451897","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

人类的知识和语言的使用与感知、行动和大脑的组织有着千丝万缕的联系，然而自然语言处理仍然由文本主导!我们需要在其他模式和环境的背景下对语言(包括言语)进行更多的研究，而现在正是进行研究的最佳时机。我不会在演讲中引用“婴儿是如何学习的”这个老生常谈、夸大其词的短语，我将介绍我的团队在语言、视觉和行动方面的三个方面的努力。首先:我们在语音图像表示学习和检索方面的工作，其中我们展示了直接编码语音的设置优于使用自动语音识别和强文本编码器的难以击败的策略。第二:文本到图像生成的两个模型:利用鼠标轨迹形式的用户引导的多阶段模型和使用跨模态对比损失的单阶段模型。第三:room - cross- room，这是一个用于视觉和语言导航的多语言数据集，我们为此收集了语音导航说明，高质量的文本转录，以及高清360度全景图中单词和像素之间的细粒度对齐。最后，我将提出一些想法，说明计算语言基础的工作如何更广泛地为加强和推进我们对语言的科学理解及其在人类智能中的基本作用提供了新的机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Language, Vision and Action are Better Together

Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language-including speech-in the context of other modalities and environments is needed, and there has never been a better time to do it. Without ever invoking the worn-out, overblown phrase ”how babies learn” in the talk, I’ll cover three of my team’s efforts involving language, vision and action. First: our work on speech-image representation learning and retrieval, where we demonstrate settings in which directly encoding speech outperforms the hard-to-beat strategy of using automatic speech recognition and strong text encoders. Second: two models for text-to-image generation: a multi-stage model which exploits user-guidance in the form of mouse traces and a single-stage one which uses cross-modal contrastive losses. Third: Room-across-Room, a multilingual dataset for vision-and-language navigation, for which we collected spoken navigation instructions, high-quality text transcriptions, and fine-grained alignments between words and pixels in high-definition 360-degree panoramas. I’ll wrap up with some thoughts on how work on computational language grounding more broadly presents new opportunities to enhance and advance our scientific understanding of language and its fundamental role in human intelligence.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Companion Proceedings of the Web Conference 2021

自引率

0.00%

发文量