Toddler-inspired embodied vision for learning object representations

2022 IEEE International Conference on Development and Learning (ICDL) Pub Date : 2022-09-12 DOI:10.1109/ICDL53763.2022.9962190

A. Aubret, Céline Teulière, J. Triesch

{"title":"Toddler-inspired embodied vision for learning object representations","authors":"A. Aubret, Céline Teulière, J. Triesch","doi":"10.1109/ICDL53763.2022.9962190","DOIUrl":null,"url":null,"abstract":"Recent time-contrastive learning approaches manage to learn invariant object representations without supervision. This is achieved by mapping successive views of an object onto close-by internal representations. When considering this learning approach as a model of the development of human object recognition, it is important to consider what visual input a toddler would typically observe while interacting with objects. First, human vision is highly foveated, with high resolution only available in the central region of the field of view. Second, objects may be seen against a blurry background due to toddlers’ limited depth of field. Third, during object manipulation a toddler mostly observes close objects filling a large part of the field of view due to their rather short arms. Here, we study how these effects impact the quality of visual representations learnt through time-contrastive learning. To this end, we let a visually embodied agent “play” with objects in different locations of a near photo-realistic flat. During each play session the agent views an object in multiple orientations before turning its body to view another object. The resulting sequence of views feeds a time-contrastive learning algorithm. Our results show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments. We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions. The results of our model suggest that several influences on toddler’s visual input statistics support their unsupervised learning of object representations.","PeriodicalId":274171,"journal":{"name":"2022 IEEE International Conference on Development and Learning (ICDL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Development and Learning (ICDL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDL53763.2022.9962190","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Recent time-contrastive learning approaches manage to learn invariant object representations without supervision. This is achieved by mapping successive views of an object onto close-by internal representations. When considering this learning approach as a model of the development of human object recognition, it is important to consider what visual input a toddler would typically observe while interacting with objects. First, human vision is highly foveated, with high resolution only available in the central region of the field of view. Second, objects may be seen against a blurry background due to toddlers’ limited depth of field. Third, during object manipulation a toddler mostly observes close objects filling a large part of the field of view due to their rather short arms. Here, we study how these effects impact the quality of visual representations learnt through time-contrastive learning. To this end, we let a visually embodied agent “play” with objects in different locations of a near photo-realistic flat. During each play session the agent views an object in multiple orientations before turning its body to view another object. The resulting sequence of views feeds a time-contrastive learning algorithm. Our results show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments. We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions. The results of our model suggest that several influences on toddler’s visual input statistics support their unsupervised learning of object representations.

查看原文本刊更多论文

幼儿启发的具身视觉学习对象表征

最近的时间对比学习方法设法在没有监督的情况下学习不变的对象表示。这是通过将对象的连续视图映射到邻近的内部表示来实现的。当考虑将这种学习方法作为人类物体识别发展的模型时，重要的是要考虑幼儿在与物体互动时通常会观察到什么视觉输入。首先，人类的视觉是高度聚焦的，只有在视野的中心区域才有高分辨率。其次，由于幼儿景深有限，物体可能会在模糊的背景下被看到。第三，在物体操作过程中，幼儿主要观察近距离的物体，因为他们的手臂相当短，占据了视野的很大一部分。在这里，我们研究这些影响如何影响通过时间对比学习学习的视觉表征的质量。为此，我们让一个视觉具体化的代理在接近逼真的平面的不同位置“玩”物体。在每次游戏过程中，智能体在转动身体观看另一个物体之前，会从多个方向观看一个物体。由此产生的视图序列提供了一个时间对比学习算法。我们的研究结果表明，在熟悉和陌生的环境中，模仿幼儿的视觉统计可以提高物体识别的准确性。我们认为，这种影响是由背景中提取的特征减少，图像中大特征的神经网络偏差以及新背景和熟悉背景区域之间更大的相似性引起的。我们的模型结果表明，对幼儿视觉输入统计的几种影响支持他们对物体表征的无监督学习。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Development and Learning (ICDL)

自引率

0.00%

发文量