Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction

arXiv - CS - Sound Pub Date : 2024-05-05 DOI:arxiv-2405.02821

Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman

{"title":"Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction","authors":"Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman","doi":"arxiv-2405.02821","DOIUrl":null,"url":null,"abstract":"Sim2real transfer has received increasing attention lately due to the success\nof learning robotic tasks in simulation end-to-end. While there has been a lot\nof progress in transferring vision-based navigation policies, the existing\nsim2real strategy for audio-visual navigation performs data augmentation\nempirically without measuring the acoustic gap. The sound differs from light in\nthat it spans across much wider frequencies and thus requires a different\nsolution for sim2real. We propose the first treatment of sim2real for\naudio-visual navigation by disentangling it into acoustic field prediction\n(AFP) and waypoint navigation. We first validate our design choice in the\nSoundSpaces simulator and show improvement on the Continuous AudioGoal\nnavigation benchmark. We then collect real-world data to measure the spectral\ndifference between the simulation and the real world by training AFP models\nthat only take a specific frequency subband as input. We further propose a\nfrequency-adaptive strategy that intelligently selects the best frequency band\nfor prediction based on both the measured spectral difference and the energy\ndistribution of the received audio, which improves the performance on the real\ndata. Lastly, we build a real robot platform and show that the transferred\npolicy can successfully navigate to sounding objects. This work demonstrates\nthe potential of building intelligent agents that can see, hear, and act\nentirely from simulation, and transferring them to the real world.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02821","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy for audio-visual navigation performs data augmentation empirically without measuring the acoustic gap. The sound differs from light in that it spans across much wider frequencies and thus requires a different solution for sim2real. We propose the first treatment of sim2real for audio-visual navigation by disentangling it into acoustic field prediction (AFP) and waypoint navigation. We first validate our design choice in the SoundSpaces simulator and show improvement on the Continuous AudioGoal navigation benchmark. We then collect real-world data to measure the spectral difference between the simulation and the real world by training AFP models that only take a specific frequency subband as input. We further propose a frequency-adaptive strategy that intelligently selects the best frequency band for prediction based on both the measured spectral difference and the energy distribution of the received audio, which improves the performance on the real data. Lastly, we build a real robot platform and show that the transferred policy can successfully navigate to sounding objects. This work demonstrates the potential of building intelligent agents that can see, hear, and act entirely from simulation, and transferring them to the real world.

查看原文本刊更多论文

利用频率自适应声场预测进行视听导航的 Sim2Real 传输

由于端到端仿真机器人任务学习的成功，仿真到真实的转换近来受到越来越多的关注。虽然在基于视觉的导航策略转移方面已经取得了很大进展，但现有的用于视听导航的 Sim2 Real 策略是在不测量声学间隙的情况下经验性地执行数据增强。声音与光不同，它的频率跨度更大，因此需要不同的 sim2real 解决方案。我们首次提出了用于视听导航的 sim2real 方法，将其分为声场预测（AFP）和航点导航。我们首先在声场模拟器（SoundSpaces）中验证了我们的设计选择，并在连续音频目标导航（Continuous AudioGoalnavigation）基准测试中展示了改进效果。然后，我们收集真实世界的数据，通过训练只将特定频率子带作为输入的 AFP 模型来测量模拟与真实世界之间的频谱差异。我们进一步提出了一种频率自适应策略，根据测量到的频谱差和接收音频的能量分布，智能地选择最佳频段进行预测，从而提高了在真实数据上的性能。最后，我们搭建了一个真实的机器人平台，并展示了所传输的策略能够成功导航到发声物体。这项工作展示了构建智能代理的潜力，这些代理可以完全通过模拟来观看、聆听和行动，并将它们转移到真实世界中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量