SMAK-Net: Self-Supervised Multi-level Spatial Attention Network for Knowledge Representation towards Imitation Learning

Kartik Ramachandruni, M. Vankadari, A. Majumder, S. Dutta, Swagat Kumar
{"title":"SMAK-Net: Self-Supervised Multi-level Spatial Attention Network for Knowledge Representation towards Imitation Learning","authors":"Kartik Ramachandruni, M. Vankadari, A. Majumder, S. Dutta, Swagat Kumar","doi":"10.1109/RO-MAN46459.2019.8956303","DOIUrl":null,"url":null,"abstract":"In this paper, we propose an end-to-end self-supervised feature representation network for imitation learning. The proposed network incorporates a novel multi-level spatial attention module to amplify the relevant and suppress the irrelevant information while learning task-specific feature embeddings. The multi-level attention module takes multiple intermediate feature maps of the input image at different stages of the CNN pipeline and results a 2D matrix of compatibility scores for each feature map with respect to the given task. The weighted combination of the feature vectors with the scores estimated from attention modules leads to a more task specific feature representation of the input images. We thus name the proposed network as SMAK-Net, abbreviated from Self-supervised Multi-level spatial Attention Knowledge representation Network. We have trained this network using a metric learning loss which aims to decrease the distance between the feature representations of simultaneous frames from multiple view points and increases the distance between the neighboring frames of the same view point. The experiments are performed on the publicly available Multi-View pouring dataset [1]. The outputs of the attention module are demonstrated to highlight the task specific objects while suppressing the rest of the background in the input image. The proposed method is validated by qualitative and quantitative comparisons with the state-of-the art technique TCN [1] along with intensive ablation studies. This method is shown to significantly outperform TCN by 6.5% in the temporal alignment error metric while reducing the total number of training steps by 155K.","PeriodicalId":286478,"journal":{"name":"2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RO-MAN46459.2019.8956303","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In this paper, we propose an end-to-end self-supervised feature representation network for imitation learning. The proposed network incorporates a novel multi-level spatial attention module to amplify the relevant and suppress the irrelevant information while learning task-specific feature embeddings. The multi-level attention module takes multiple intermediate feature maps of the input image at different stages of the CNN pipeline and results a 2D matrix of compatibility scores for each feature map with respect to the given task. The weighted combination of the feature vectors with the scores estimated from attention modules leads to a more task specific feature representation of the input images. We thus name the proposed network as SMAK-Net, abbreviated from Self-supervised Multi-level spatial Attention Knowledge representation Network. We have trained this network using a metric learning loss which aims to decrease the distance between the feature representations of simultaneous frames from multiple view points and increases the distance between the neighboring frames of the same view point. The experiments are performed on the publicly available Multi-View pouring dataset [1]. The outputs of the attention module are demonstrated to highlight the task specific objects while suppressing the rest of the background in the input image. The proposed method is validated by qualitative and quantitative comparisons with the state-of-the art technique TCN [1] along with intensive ablation studies. This method is shown to significantly outperform TCN by 6.5% in the temporal alignment error metric while reducing the total number of training steps by 155K.
SMAK-Net:面向模仿学习的知识表示自监督多层次空间注意网络
在本文中,我们提出了一个端到端的自监督特征表示网络用于模仿学习。该网络采用了一种新颖的多层次空间注意模块,在学习特定任务特征嵌入的同时放大相关信息和抑制不相关信息。多层次关注模块在CNN管道的不同阶段获取输入图像的多个中间特征图,并得出每个特征图相对于给定任务的兼容性分数的二维矩阵。特征向量与注意力模块估计的分数的加权组合导致输入图像的更具体的任务特征表示。因此,我们将提出的网络命名为SMAK-Net,简称自监督多层次空间注意知识表示网络。我们使用度量学习损失来训练这个网络,目的是减少来自多个视点的同时帧的特征表示之间的距离,并增加相同视点的相邻帧之间的距离。实验是在公开的多视图浇注数据集上进行的[1]。注意模块的输出被演示为突出任务特定对象,同时抑制输入图像中背景的其余部分。通过与最先进的TCN技术[1]以及强化消融研究进行定性和定量比较,验证了所提出的方法。该方法在时间对齐误差指标上明显优于TCN 6.5%,同时将总训练步数减少了155K。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信