Connecting Language and Vision: From Captioning towards Embodied Learning

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI:10.1145/3347450.3357660

Subhashini Venugopalan

{"title":"Connecting Language and Vision: From Captioning towards Embodied Learning","authors":"Subhashini Venugopalan","doi":"10.1145/3347450.3357660","DOIUrl":null,"url":null,"abstract":"For most humans, understanding multimedia content is easy, and in many cases images and videos are a preferred means of augmenting and enhancing human interaction and communication. Given a video, humans can discern a great deal from this rich information source and can interpret and describe the content to varying degrees of detail. For computers however, interpreting content from image and video pixels and associating them with language is very challenging. Research in the recent past has made tremendous progress in this problem of visual language grounding, i.e. interpreting visual content, from images and videos, and associating them with language. This progress has been made possible not only by advances in object recognition, activity recognition, and language generation, but also by developing versatile and elegant ways of combining them. However to realize the long-term goal of enabling fluent interaction between humans and computers/robots, it is also essential to ground language in action in addition to vision. In this respect embodied, task-oriented aspect of language grounding has emerged as a research direction that is garnering much attention. Current research focuses on developing new datasets and techniques for linking language to action in the real world, such as agents that follow instructions for navigation tasks or manipulation tasks. Following the exciting progress in this space, we expect research in connecting language and vision to continue to accelerate in the coming years towards the development of embodied agents that learn to navigate the real world through human interaction.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"439 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3347450.3357660","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

For most humans, understanding multimedia content is easy, and in many cases images and videos are a preferred means of augmenting and enhancing human interaction and communication. Given a video, humans can discern a great deal from this rich information source and can interpret and describe the content to varying degrees of detail. For computers however, interpreting content from image and video pixels and associating them with language is very challenging. Research in the recent past has made tremendous progress in this problem of visual language grounding, i.e. interpreting visual content, from images and videos, and associating them with language. This progress has been made possible not only by advances in object recognition, activity recognition, and language generation, but also by developing versatile and elegant ways of combining them. However to realize the long-term goal of enabling fluent interaction between humans and computers/robots, it is also essential to ground language in action in addition to vision. In this respect embodied, task-oriented aspect of language grounding has emerged as a research direction that is garnering much attention. Current research focuses on developing new datasets and techniques for linking language to action in the real world, such as agents that follow instructions for navigation tasks or manipulation tasks. Following the exciting progress in this space, we expect research in connecting language and vision to continue to accelerate in the coming years towards the development of embodied agents that learn to navigate the real world through human interaction.

查看原文本刊更多论文

连接语言和视觉:从字幕到具身学习

对于大多数人来说，理解多媒体内容是很容易的，在许多情况下，图像和视频是增强和增强人类交互和通信的首选手段。给定一个视频，人类可以从这个丰富的信息源中辨别出很多东西，并且可以对内容进行不同程度的详细解释和描述。然而，对于计算机来说，从图像和视频像素中解释内容并将它们与语言联系起来是非常具有挑战性的。近年来的研究在视觉语言基础问题上取得了巨大的进展，即从图像和视频中解读视觉内容，并将其与语言联系起来。这一进步之所以成为可能，不仅是因为物体识别、活动识别和语言生成方面的进步，还因为开发了将它们结合起来的通用而优雅的方法。然而，为了实现人类与计算机/机器人之间流畅交互的长期目标，除了视觉之外，还必须具备行动中的地面语言。在这方面的体现，任务导向方面的语言基础已经成为一个备受关注的研究方向。目前的研究重点是开发新的数据集和技术，将语言与现实世界中的行为联系起来，比如在导航任务或操作任务中遵循指令的代理。随着这一领域令人兴奋的进展，我们预计在未来几年，连接语言和视觉的研究将继续加速，朝着学习通过人类互动来驾驭现实世界的具身代理的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

自引率

0.00%

发文量