1st International Workshop on Multimodal Understanding and Learning for Embodied Applications最新文献

On the Multisensory Nature of Objects and Language: A Robotics Perspective 论物体和语言的多感官本质:机器人的视角

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI: 10.1145/3347450.3357658

J. Sinapov

{"title":"On the Multisensory Nature of Objects and Language: A Robotics Perspective","authors":"J. Sinapov","doi":"10.1145/3347450.3357658","DOIUrl":"https://doi.org/10.1145/3347450.3357658","url":null,"abstract":"Infants use exploratory behaviors to learn about the objects around them. Psychologists have theorized that behaviors such as grasping touching, pressing, and lifting, coupled with the visual, tactile, haptic and auditory sensory modalities, enable infants to form grounded object representations. For example, scratching an object can provide information about its roughness, while lifting it can provide information about its weight. In a sense, the exploratory behavior acts as a \"question'' to the object, which is subsequently \"answered\" by the sensory stimuli produced during the execution of the behavior. In contrast, most object representations used by robots today rely solely on computer vision or laser scan data, gathered through passive observation. Such disembodied approaches to robotic perception may be useful for recognizing an object using a 3D model database, but nevertheless, will fail to infer object properties that cannot be detected using vision alone. To bridge this gap, our research has pursued a developmental framework for object perception and exploration in which the robot's representation of objects is grounded in its own sensorimotor experience with them citesinapov2014grounding. In this framework, an object is represented by sensorimotor contingencies that span a diverse set of exploratory behaviors and sensory modalities. In this talk, I will highlight results from several large-scale experimental studies which show that the behavior-grounded object representation enables a robot to solve a wide variety of perceptual and cognitive tasks relevant to object learning citesinapov2014learning,sinapov2011interactive. I will discuss recent work on how robots can ground language in multisensory experience with objects citethomason2016learning and will conclude with a discussion on open problems in multisensory symbol grounding, which, if solved, could result in the large-scale deployment of robotic systems in real-world domains.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130545215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning to Navigate 学习导航

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI: 10.1145/3347450.3357659

Piotr Wojciech Mirowski

{"title":"Learning to Navigate","authors":"Piotr Wojciech Mirowski","doi":"10.1145/3347450.3357659","DOIUrl":"https://doi.org/10.1145/3347450.3357659","url":null,"abstract":"Navigation is an important cognitive task that enables humans and animals to traverse, with or without maps, over long distances in the complex world. Such long-range navigation can simultaneously support self-localisation (\"I am here\") and a representation of the goal (\"I am going there\"). For this reason, studying navigation is fundamental to the study and development of artificial intelligence, and trying to replicate navigation in artificial agents can also help neuroscientists understand its biological underpinnings. This talk will cover our own journey to understand navigation by building deep reinforcement learning agents, starting from learning to control a simple agent that can explore and memorise large 3D mazes to designing agents with a read-write memory that can generalise to unseen mazes from one traversal. I will show how these artificial agents relate to navigation in the real world, both through the study of the emergence of grid cell representations in neural networks and by demonstrating that these agents can navigate in Street View-based real world photographic environments. I will finally present two approaches in our ongoing work on leveraging multimodal information for generalising navigation policies to unseen environments in Street View, one consisting in following language instructions and the second one in transferring navigation policies by training on aerial views.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122991897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Connecting Language and Vision: From Captioning towards Embodied Learning 连接语言和视觉:从字幕到具身学习

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI: 10.1145/3347450.3357660

Subhashini Venugopalan

{"title":"Connecting Language and Vision: From Captioning towards Embodied Learning","authors":"Subhashini Venugopalan","doi":"10.1145/3347450.3357660","DOIUrl":"https://doi.org/10.1145/3347450.3357660","url":null,"abstract":"For most humans, understanding multimedia content is easy, and in many cases images and videos are a preferred means of augmenting and enhancing human interaction and communication. Given a video, humans can discern a great deal from this rich information source and can interpret and describe the content to varying degrees of detail. For computers however, interpreting content from image and video pixels and associating them with language is very challenging. Research in the recent past has made tremendous progress in this problem of visual language grounding, i.e. interpreting visual content, from images and videos, and associating them with language. This progress has been made possible not only by advances in object recognition, activity recognition, and language generation, but also by developing versatile and elegant ways of combining them. However to realize the long-term goal of enabling fluent interaction between humans and computers/robots, it is also essential to ground language in action in addition to vision. In this respect embodied, task-oriented aspect of language grounding has emerged as a research direction that is garnering much attention. Current research focuses on developing new datasets and techniques for linking language to action in the real world, such as agents that follow instructions for navigation tasks or manipulation tasks. Following the exciting progress in this space, we expect research in connecting language and vision to continue to accelerate in the coming years towards the development of embodied agents that learn to navigate the real world through human interaction.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"439 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128849322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Improvement on Audio-to-MIDI Alignment Using Triplet Pair 利用三重态对改进音频- midi对齐

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI: 10.1145/3347450.3357661

Yifan Wang, Shuchang Liu, Li Guo

{"title":"An Improvement on Audio-to-MIDI Alignment Using Triplet Pair","authors":"Yifan Wang, Shuchang Liu, Li Guo","doi":"10.1145/3347450.3357661","DOIUrl":"https://doi.org/10.1145/3347450.3357661","url":null,"abstract":"In this paper, we employ a neural network based cross-modality model on audio-to-MIDI alignment task. A novel loss function based on Hinge Loss is proposed to optimize the model learning an Euclidean embedding space, where the distance of embedding vectors can be directly used as a measure of similarity in alignment. In the previous alignment system also based on cross-modality model, there are positive and negative pairs in the loss function, which represent aligned and misaligned pairs. In this paper, we introduce an extra pair named overlapping to capture musical onset information. We evaluate our system on the MAPS dataset and compare it to other previous methods. The results reveal that the align accuracy of the proposed system beats the transcription based method by a significant margin, e.g., 81.61% to 86.41%, when the align error threshold is set to 10 ms. And the proposed loss also has an improvement on the statistics of absolute onset errors in comparison to the loss function implemented in other audio-to-MIDI alignment system. We also conduct experiments on the dimension of embedding vectors and results show the proposed system can still maintain the alignment performance with lower dimension.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126461489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Video Object Linguistic Grounding 视频对象语言基础

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI: 10.1145/3347450.3357662

Alba Herrera-Palacio, Carles Ventura, Xavier Giró-i-Nieto

引用次数: 1

Clustering Optimization for Abnormality Detection in Semi-Autonomous Systems 半自治系统异常检测的聚类优化

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI: 10.1145/3347450.3357657

Hafsa Iqbal, Damian Campo, Mohamad Baydoun, L. Marcenaro, David Martín Gómez, C. Regazzoni

引用次数: 13

Visually Grounded Language Learning for Robot Navigation 基于视觉的机器人导航语言学习

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI: 10.1145/3347450.3357655

E. Ünal, Ozan Arkan Can, Y. Yemez

引用次数: 3

Geometry-aware Relational Exemplar Attention for Dense Captioning 密集标题的几何感知关系范例注意

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI: 10.1145/3347450.3357656

T. Wang, H. R. Tavakoli, Mats Sjöberg, Jorma T. Laaksonen

{"title":"Geometry-aware Relational Exemplar Attention for Dense Captioning","authors":"T. Wang, H. R. Tavakoli, Mats Sjöberg, Jorma T. Laaksonen","doi":"10.1145/3347450.3357656","DOIUrl":"https://doi.org/10.1145/3347450.3357656","url":null,"abstract":"Dense captioning (DC), which provides a comprehensive context understanding of images by describing all salient visual groundings in an image, facilitates multimodal understanding and learning. As an extension of image captioning, DC is developed to discover richer sets of visual contents and to generate captions of wider diversity and increased details. The state-of-the-art models of DC consist of three stages: (1) region proposals, (2) region classification, and (3) caption generation for each proposal. They are typically built upon the following ideas: (a) guiding the caption generation with image-level features as the context cues along with regional features and (b) refining locations of region proposals with caption information. In this work, we propose (a) a joint visual-textual criterion exploited by the region classifier that further improves both region detection and caption accuracy, and (b) a Geometry aware Relational Exemplar attention (GREatt) mechanism to relate region proposals. The former helps the model learn a region classifier by effectively exploiting both visual groundings and caption descriptions. Rather than treating each region proposal in isolation, the latter relates regions in complementary relations, i.e. contextually dependent, visually supported and geometry relations, to enrich context information in regional representations. We conduct an extensive set of experiments and demonstrate that our proposed model improves the state-of-the-art by at least +5.3% in terms of the mean average precision on the Visual Genome dataset.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122483451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Deep Reinforcement Learning Visual-Text Attention for Multimodal Video Classification 用于多模态视频分类的深度强化学习视觉-文本注意

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-10-15 DOI: 10.1145/3347450.3357654

Mengyi Liu, Zhu Liu

{"title":"Deep Reinforcement Learning Visual-Text Attention for Multimodal Video Classification","authors":"Mengyi Liu, Zhu Liu","doi":"10.1145/3347450.3357654","DOIUrl":"https://doi.org/10.1145/3347450.3357654","url":null,"abstract":"Nowadays multimedia contents including text, images, and videos have been produced and shared ubiquitously in our daily life, which has encouraged researchers to develop algorithms for multimedia search and analysis in various applications. The trend of web data becoming increasingly multimodal makes the task of multimodal classification ever more popular and pertinent. In this paper, we mainly focus on the scenario of videos for their intrinsic multimodal property, and resort to attention learning among different modalities for classification. Specifically, we formulate the multimodal attention learning as a sequential decision-making process, and propose an end-to-end, deep reinforcement learning based framework to determine the selection of modality at each time step for the final feature aggregation model. To train our policy networks, we design a supervised reward which considers the multi-label classification loss, and two unsupervised rewards which simultaneously consider inter-modality correlation for consistency and intra-modality reconstruction for representativeness. Extensive experiments have been conducted on two large-scale multimodal video datasets to evaluate the whole framework and several key components, including the parameters of policy network, the effects of different rewards, and the rationality of the learned visual-text attention. Promising results demonstrate that our approach outperforms other state-of-the-art methods of attention mechanism and multimodal fusion for video classification task.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132439215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MultiLock: Mobile Active Authentication based on Multiple Biometric and Behavioral Patterns MultiLock:基于多种生物特征和行为模式的移动主动认证

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications Pub Date : 2019-01-29 DOI: 10.1145/3347450.3357663

A. Acien, A. Morales, R. Vera-Rodríguez, Julian Fierrez

{"title":"MultiLock: Mobile Active Authentication based on Multiple Biometric and Behavioral Patterns","authors":"A. Acien, A. Morales, R. Vera-Rodríguez, Julian Fierrez","doi":"10.1145/3347450.3357663","DOIUrl":"https://doi.org/10.1145/3347450.3357663","url":null,"abstract":"In this paper we evaluate how discriminative are behavior-based signals obtained from the smartphone sensors. The main aim is to evaluate these signals for person recognition. The recognition based on these signals increases the security of devices, but also implies privacy concerns. We consider seven different data channels and their combinations. Touch dynamics (touch gestures and keystroking), accelerometer, gyroscope, WiFi, GPS location and app usage are all collected during human-mobile interaction to authenticate the users. We evaluate two approaches: one-time authentication and active authentication. In one-time authentication, we employ the information of all channels available during one session. For active authentication we take advantage of mobile user behavior across multiple sessions by updating a confidence value of the authentication score. Our experiments are conducted on the semi-uncontrolled UMDAA-02 database. This database comprises of smartphone sensor signals acquired during natural human-mobile interaction. Our results show that different traits can be complementary and multimodal systems clearly increase the performance with accuracies ranging from 82.2% to 97.1% depending on the authentication scenario. These results confirm the discriminative power of these signals.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133117124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38