Generative Modeling of Audible Shapes for Object Perception

2017 IEEE International Conference on Computer Vision (ICCV) Pub Date : 2017-10-01 DOI:10.1109/ICCV.2017.141

Zhoutong Zhang, Jiajun Wu, Qiujia Li, Zhengjia Huang, James Traer, Josh H. McDermott, J. Tenenbaum, W. Freeman

引用次数: 35

Abstract

Humans infer rich knowledge of objects from both auditory and visual cues. Building a machine of such competency, however, is very challenging, due to the great difficulty in capturing large-scale, clean data of objects with both their appearance and the sound they make. In this paper, we present a novel, open-source pipeline that generates audiovisual data, purely from 3D object shapes and their physical properties. Through comparison with audio recordings and human behavioral studies, we validate the accuracy of the sounds it generates. Using this generative model, we are able to construct a synthetic audio-visual dataset, namely Sound-20K, for object perception tasks. We demonstrate that auditory and visual information play complementary roles in object perception, and further, that the representation learned on synthetic audio-visual data can transfer to real-world scenarios.

查看原文本刊更多论文

面向对象感知的可听形状生成建模

人类从听觉和视觉线索中推断出对物体的丰富知识。然而，建造一台具有这种能力的机器是非常具有挑战性的，因为很难捕获大规模的、干净的物体数据，包括它们的外观和声音。在本文中，我们提出了一种新颖的开源管道，可以纯粹从3D对象形状及其物理属性生成视听数据。通过与录音和人类行为研究的比较，我们验证了它产生的声音的准确性。使用这个生成模型，我们能够构建一个合成的视听数据集，即Sound-20K，用于对象感知任务。我们证明了听觉和视觉信息在物体感知中起着互补的作用，而且，在合成视听数据上学习到的表示可以转移到现实世界的场景中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量