1st International Workshop on Multimodal Understanding and Learning for Embodied Applications最新文献

筛选
英文 中文
On the Multisensory Nature of Objects and Language: A Robotics Perspective 论物体和语言的多感官本质:机器人的视角
J. Sinapov
{"title":"On the Multisensory Nature of Objects and Language: A Robotics Perspective","authors":"J. Sinapov","doi":"10.1145/3347450.3357658","DOIUrl":"https://doi.org/10.1145/3347450.3357658","url":null,"abstract":"Infants use exploratory behaviors to learn about the objects around them. Psychologists have theorized that behaviors such as grasping touching, pressing, and lifting, coupled with the visual, tactile, haptic and auditory sensory modalities, enable infants to form grounded object representations. For example, scratching an object can provide information about its roughness, while lifting it can provide information about its weight. In a sense, the exploratory behavior acts as a \"question'' to the object, which is subsequently \"answered\" by the sensory stimuli produced during the execution of the behavior. In contrast, most object representations used by robots today rely solely on computer vision or laser scan data, gathered through passive observation. Such disembodied approaches to robotic perception may be useful for recognizing an object using a 3D model database, but nevertheless, will fail to infer object properties that cannot be detected using vision alone. To bridge this gap, our research has pursued a developmental framework for object perception and exploration in which the robot's representation of objects is grounded in its own sensorimotor experience with them citesinapov2014grounding. In this framework, an object is represented by sensorimotor contingencies that span a diverse set of exploratory behaviors and sensory modalities. In this talk, I will highlight results from several large-scale experimental studies which show that the behavior-grounded object representation enables a robot to solve a wide variety of perceptual and cognitive tasks relevant to object learning citesinapov2014learning,sinapov2011interactive. I will discuss recent work on how robots can ground language in multisensory experience with objects citethomason2016learning and will conclude with a discussion on open problems in multisensory symbol grounding, which, if solved, could result in the large-scale deployment of robotic systems in real-world domains.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130545215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning to Navigate 学习导航
Piotr Wojciech Mirowski
{"title":"Learning to Navigate","authors":"Piotr Wojciech Mirowski","doi":"10.1145/3347450.3357659","DOIUrl":"https://doi.org/10.1145/3347450.3357659","url":null,"abstract":"Navigation is an important cognitive task that enables humans and animals to traverse, with or without maps, over long distances in the complex world. Such long-range navigation can simultaneously support self-localisation (\"I am here\") and a representation of the goal (\"I am going there\"). For this reason, studying navigation is fundamental to the study and development of artificial intelligence, and trying to replicate navigation in artificial agents can also help neuroscientists understand its biological underpinnings. This talk will cover our own journey to understand navigation by building deep reinforcement learning agents, starting from learning to control a simple agent that can explore and memorise large 3D mazes to designing agents with a read-write memory that can generalise to unseen mazes from one traversal. I will show how these artificial agents relate to navigation in the real world, both through the study of the emergence of grid cell representations in neural networks and by demonstrating that these agents can navigate in Street View-based real world photographic environments. I will finally present two approaches in our ongoing work on leveraging multimodal information for generalising navigation policies to unseen environments in Street View, one consisting in following language instructions and the second one in transferring navigation policies by training on aerial views.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122991897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Connecting Language and Vision: From Captioning towards Embodied Learning 连接语言和视觉:从字幕到具身学习
Subhashini Venugopalan
{"title":"Connecting Language and Vision: From Captioning towards Embodied Learning","authors":"Subhashini Venugopalan","doi":"10.1145/3347450.3357660","DOIUrl":"https://doi.org/10.1145/3347450.3357660","url":null,"abstract":"For most humans, understanding multimedia content is easy, and in many cases images and videos are a preferred means of augmenting and enhancing human interaction and communication. Given a video, humans can discern a great deal from this rich information source and can interpret and describe the content to varying degrees of detail. For computers however, interpreting content from image and video pixels and associating them with language is very challenging. Research in the recent past has made tremendous progress in this problem of visual language grounding, i.e. interpreting visual content, from images and videos, and associating them with language. This progress has been made possible not only by advances in object recognition, activity recognition, and language generation, but also by developing versatile and elegant ways of combining them. However to realize the long-term goal of enabling fluent interaction between humans and computers/robots, it is also essential to ground language in action in addition to vision. In this respect embodied, task-oriented aspect of language grounding has emerged as a research direction that is garnering much attention. Current research focuses on developing new datasets and techniques for linking language to action in the real world, such as agents that follow instructions for navigation tasks or manipulation tasks. Following the exciting progress in this space, we expect research in connecting language and vision to continue to accelerate in the coming years towards the development of embodied agents that learn to navigate the real world through human interaction.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"439 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128849322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Improvement on Audio-to-MIDI Alignment Using Triplet Pair 利用三重态对改进音频- midi对齐
Yifan Wang, Shuchang Liu, Li Guo
{"title":"An Improvement on Audio-to-MIDI Alignment Using Triplet Pair","authors":"Yifan Wang, Shuchang Liu, Li Guo","doi":"10.1145/3347450.3357661","DOIUrl":"https://doi.org/10.1145/3347450.3357661","url":null,"abstract":"In this paper, we employ a neural network based cross-modality model on audio-to-MIDI alignment task. A novel loss function based on Hinge Loss is proposed to optimize the model learning an Euclidean embedding space, where the distance of embedding vectors can be directly used as a measure of similarity in alignment. In the previous alignment system also based on cross-modality model, there are positive and negative pairs in the loss function, which represent aligned and misaligned pairs. In this paper, we introduce an extra pair named overlapping to capture musical onset information. We evaluate our system on the MAPS dataset and compare it to other previous methods. The results reveal that the align accuracy of the proposed system beats the transcription based method by a significant margin, e.g., 81.61% to 86.41%, when the align error threshold is set to 10 ms. And the proposed loss also has an improvement on the statistics of absolute onset errors in comparison to the loss function implemented in other audio-to-MIDI alignment system. We also conduct experiments on the dimension of embedding vectors and results show the proposed system can still maintain the alignment performance with lower dimension.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126461489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Video Object Linguistic Grounding 视频对象语言基础
Alba Herrera-Palacio, Carles Ventura, Xavier Giró-i-Nieto
{"title":"Video Object Linguistic Grounding","authors":"Alba Herrera-Palacio, Carles Ventura, Xavier Giró-i-Nieto","doi":"10.1145/3347450.3357662","DOIUrl":"https://doi.org/10.1145/3347450.3357662","url":null,"abstract":"The goal of this work is segmenting on a video sequence the objects which are mentioned in a linguistic description of the scene. We have adapted an existing deep neural network that achieves state of the art performance in semi-supervised video object segmentation, to add a linguistic branch that would generate an attention map over the video frames, making the segmentation of the objects temporally consistent along the sequence.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122862175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Clustering Optimization for Abnormality Detection in Semi-Autonomous Systems 半自治系统异常检测的聚类优化
Hafsa Iqbal, Damian Campo, Mohamad Baydoun, L. Marcenaro, David Martín Gómez, C. Regazzoni
{"title":"Clustering Optimization for Abnormality Detection in Semi-Autonomous Systems","authors":"Hafsa Iqbal, Damian Campo, Mohamad Baydoun, L. Marcenaro, David Martín Gómez, C. Regazzoni","doi":"10.1145/3347450.3357657","DOIUrl":"https://doi.org/10.1145/3347450.3357657","url":null,"abstract":"The use of machine learning techniques is fundamental for developing autonomous systems that can assist humans in everyday tasks. This paper focus on selecting an appropriate network size for detecting abnormalities in multisensory data coming from a semi-autonomous vehicle. We use an extension of Growing Neural Gas with the utility measurement (GNG-U) for segmenting multisensory data into an optimal set of clusters that facilitate a semantic interpretation of data and define local linear models used for prediction purposes. A functional that favors precise linear dynamical models in large state space regions is considered for optimization purposes. The proposed method is tested with synchronized multi-sensor dynamic data related to different maneuvering tasks performed by a semi-autonomous vehicle that interacts with pedestrians in a closed environment. Comparisons with a previous work of abnormality detection are provided.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129425529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Visually Grounded Language Learning for Robot Navigation 基于视觉的机器人导航语言学习
E. Ünal, Ozan Arkan Can, Y. Yemez
{"title":"Visually Grounded Language Learning for Robot Navigation","authors":"E. Ünal, Ozan Arkan Can, Y. Yemez","doi":"10.1145/3347450.3357655","DOIUrl":"https://doi.org/10.1145/3347450.3357655","url":null,"abstract":"We present an end-to-end deep learning model for robot navigation from raw visual pixel input and natural text instructions. The proposed model is an LSTM-based sequence-to-sequence neural network architecture with attention, which is trained on instruction-perception data samples collected in a synthetic environment. We conduct experiments on the SAIL dataset which we reconstruct in 3D so as to generate the 2D images associated with the data. Our experiments show that the performance of our model is on a par with state-of-the-art, despite the fact that it learns navigational language with end-to-end training from raw visual data.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126730467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Geometry-aware Relational Exemplar Attention for Dense Captioning 密集标题的几何感知关系范例注意
T. Wang, H. R. Tavakoli, Mats Sjöberg, Jorma T. Laaksonen
{"title":"Geometry-aware Relational Exemplar Attention for Dense Captioning","authors":"T. Wang, H. R. Tavakoli, Mats Sjöberg, Jorma T. Laaksonen","doi":"10.1145/3347450.3357656","DOIUrl":"https://doi.org/10.1145/3347450.3357656","url":null,"abstract":"Dense captioning (DC), which provides a comprehensive context understanding of images by describing all salient visual groundings in an image, facilitates multimodal understanding and learning. As an extension of image captioning, DC is developed to discover richer sets of visual contents and to generate captions of wider diversity and increased details. The state-of-the-art models of DC consist of three stages: (1) region proposals, (2) region classification, and (3) caption generation for each proposal. They are typically built upon the following ideas: (a) guiding the caption generation with image-level features as the context cues along with regional features and (b) refining locations of region proposals with caption information. In this work, we propose (a) a joint visual-textual criterion exploited by the region classifier that further improves both region detection and caption accuracy, and (b) a Geometry aware Relational Exemplar attention (GREatt) mechanism to relate region proposals. The former helps the model learn a region classifier by effectively exploiting both visual groundings and caption descriptions. Rather than treating each region proposal in isolation, the latter relates regions in complementary relations, i.e. contextually dependent, visually supported and geometry relations, to enrich context information in regional representations. We conduct an extensive set of experiments and demonstrate that our proposed model improves the state-of-the-art by at least +5.3% in terms of the mean average precision on the Visual Genome dataset.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122483451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Deep Reinforcement Learning Visual-Text Attention for Multimodal Video Classification 用于多模态视频分类的深度强化学习视觉-文本注意
Mengyi Liu, Zhu Liu
{"title":"Deep Reinforcement Learning Visual-Text Attention for Multimodal Video Classification","authors":"Mengyi Liu, Zhu Liu","doi":"10.1145/3347450.3357654","DOIUrl":"https://doi.org/10.1145/3347450.3357654","url":null,"abstract":"Nowadays multimedia contents including text, images, and videos have been produced and shared ubiquitously in our daily life, which has encouraged researchers to develop algorithms for multimedia search and analysis in various applications. The trend of web data becoming increasingly multimodal makes the task of multimodal classification ever more popular and pertinent. In this paper, we mainly focus on the scenario of videos for their intrinsic multimodal property, and resort to attention learning among different modalities for classification. Specifically, we formulate the multimodal attention learning as a sequential decision-making process, and propose an end-to-end, deep reinforcement learning based framework to determine the selection of modality at each time step for the final feature aggregation model. To train our policy networks, we design a supervised reward which considers the multi-label classification loss, and two unsupervised rewards which simultaneously consider inter-modality correlation for consistency and intra-modality reconstruction for representativeness. Extensive experiments have been conducted on two large-scale multimodal video datasets to evaluate the whole framework and several key components, including the parameters of policy network, the effects of different rewards, and the rationality of the learned visual-text attention. Promising results demonstrate that our approach outperforms other state-of-the-art methods of attention mechanism and multimodal fusion for video classification task.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132439215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MultiLock: Mobile Active Authentication based on Multiple Biometric and Behavioral Patterns MultiLock:基于多种生物特征和行为模式的移动主动认证
A. Acien, A. Morales, R. Vera-Rodríguez, Julian Fierrez
{"title":"MultiLock: Mobile Active Authentication based on Multiple Biometric and Behavioral Patterns","authors":"A. Acien, A. Morales, R. Vera-Rodríguez, Julian Fierrez","doi":"10.1145/3347450.3357663","DOIUrl":"https://doi.org/10.1145/3347450.3357663","url":null,"abstract":"In this paper we evaluate how discriminative are behavior-based signals obtained from the smartphone sensors. The main aim is to evaluate these signals for person recognition. The recognition based on these signals increases the security of devices, but also implies privacy concerns. We consider seven different data channels and their combinations. Touch dynamics (touch gestures and keystroking), accelerometer, gyroscope, WiFi, GPS location and app usage are all collected during human-mobile interaction to authenticate the users. We evaluate two approaches: one-time authentication and active authentication. In one-time authentication, we employ the information of all channels available during one session. For active authentication we take advantage of mobile user behavior across multiple sessions by updating a confidence value of the authentication score. Our experiments are conducted on the semi-uncontrolled UMDAA-02 database. This database comprises of smartphone sensor signals acquired during natural human-mobile interaction. Our results show that different traits can be complementary and multimodal systems clearly increase the performance with accuracies ranging from 82.2% to 97.1% depending on the authentication scenario. These results confirm the discriminative power of these signals.","PeriodicalId":329495,"journal":{"name":"1st International Workshop on Multimodal Understanding and Learning for Embodied Applications","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133117124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信