{"title":"Connecting Language and Vision: From Captioning towards Embodied Learning","authors":"Subhashini Venugopalan","doi":"10.1145/3347450.3357660","DOIUrl":null,"url":null,"abstract":"For most humans, understanding multimedia content is easy, and in many cases images and videos are a preferred means of augmenting and enhancing human interaction and communication. Given a video, humans can discern a great deal from this rich information source and can interpret and describe the content to varying degrees of detail. For computers however, interpreting content from image and video pixels and associating them with language is very challenging. Research in the recent past has made tremendous progress in this problem of visual language grounding, i.e. interpreting visual content, from images and videos, and associating them with language. This progress has been made possible not only by advances in object recognition, activity recognition, and language generation, but also by developing versatile and elegant ways of combining them. However to realize the long-term goal of enabling fluent interaction between humans and computers/robots, it is also essential to ground language in action in addition to vision. In this respect embodied, task-oriented aspect of language grounding has emerged as a research direction that is garnering much attention. Current research focuses on developing new datasets and techniques for linking language to action in the real world, such as agents that follow instructions for navigation tasks or manipulation tasks. Following the exciting progress in this space, we expect research in connecting language and vision to continue to accelerate in the coming years towards the development of embodied agents that learn to navigate the real world through human interaction.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"439 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3347450.3357660","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
For most humans, understanding multimedia content is easy, and in many cases images and videos are a preferred means of augmenting and enhancing human interaction and communication. Given a video, humans can discern a great deal from this rich information source and can interpret and describe the content to varying degrees of detail. For computers however, interpreting content from image and video pixels and associating them with language is very challenging. Research in the recent past has made tremendous progress in this problem of visual language grounding, i.e. interpreting visual content, from images and videos, and associating them with language. This progress has been made possible not only by advances in object recognition, activity recognition, and language generation, but also by developing versatile and elegant ways of combining them. However to realize the long-term goal of enabling fluent interaction between humans and computers/robots, it is also essential to ground language in action in addition to vision. In this respect embodied, task-oriented aspect of language grounding has emerged as a research direction that is garnering much attention. Current research focuses on developing new datasets and techniques for linking language to action in the real world, such as agents that follow instructions for navigation tasks or manipulation tasks. Following the exciting progress in this space, we expect research in connecting language and vision to continue to accelerate in the coming years towards the development of embodied agents that learn to navigate the real world through human interaction.