{"title":"Language, Vision and Action are Better Together","authors":"Jason Baldridge","doi":"10.1145/3442442.3451897","DOIUrl":null,"url":null,"abstract":"Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language-including speech-in the context of other modalities and environments is needed, and there has never been a better time to do it. Without ever invoking the worn-out, overblown phrase ”how babies learn” in the talk, I’ll cover three of my team’s efforts involving language, vision and action. First: our work on speech-image representation learning and retrieval, where we demonstrate settings in which directly encoding speech outperforms the hard-to-beat strategy of using automatic speech recognition and strong text encoders. Second: two models for text-to-image generation: a multi-stage model which exploits user-guidance in the form of mouse traces and a single-stage one which uses cross-modal contrastive losses. Third: Room-across-Room, a multilingual dataset for vision-and-language navigation, for which we collected spoken navigation instructions, high-quality text transcriptions, and fine-grained alignments between words and pixels in high-definition 360-degree panoramas. I’ll wrap up with some thoughts on how work on computational language grounding more broadly presents new opportunities to enhance and advance our scientific understanding of language and its fundamental role in human intelligence.","PeriodicalId":129420,"journal":{"name":"Companion Proceedings of the Web Conference 2021","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the Web Conference 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3442442.3451897","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language-including speech-in the context of other modalities and environments is needed, and there has never been a better time to do it. Without ever invoking the worn-out, overblown phrase ”how babies learn” in the talk, I’ll cover three of my team’s efforts involving language, vision and action. First: our work on speech-image representation learning and retrieval, where we demonstrate settings in which directly encoding speech outperforms the hard-to-beat strategy of using automatic speech recognition and strong text encoders. Second: two models for text-to-image generation: a multi-stage model which exploits user-guidance in the form of mouse traces and a single-stage one which uses cross-modal contrastive losses. Third: Room-across-Room, a multilingual dataset for vision-and-language navigation, for which we collected spoken navigation instructions, high-quality text transcriptions, and fine-grained alignments between words and pixels in high-definition 360-degree panoramas. I’ll wrap up with some thoughts on how work on computational language grounding more broadly presents new opportunities to enhance and advance our scientific understanding of language and its fundamental role in human intelligence.