来自音频的互补线索有助于在弱监督对象检测中对抗噪声

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2023-01-01 DOI:10.1109/WACV56688.2023.00222

Cagri Gungor, Adriana Kovashka

{"title":"来自音频的互补线索有助于在弱监督对象检测中对抗噪声","authors":"Cagri Gungor, Adriana Kovashka","doi":"10.1109/WACV56688.2023.00222","DOIUrl":null,"url":null,"abstract":"We tackle the problem of learning object detectors in a noisy environment, which is one of the significant challenges for weakly-supervised learning. We use multimodal learning to help localize objects of interest, but unlike other methods, we treat audio as an auxiliary modality that assists to tackle noise in detection from visual regions. First, we use the audio-visual model to generate new \"ground-truth\" labels for the training set to remove noise between the visual features and noisy supervision. Second, we propose an \"indirect path\" between audio and class predictions, which combines the link between visual and audio regions, and the link between visual features and predictions. Third, we propose a sound-based \"attention path\" which uses the benefit of complementary audio cues to identify important visual regions. We use contrastive learning to perform region-based audio-visual instance discrimination, which serves as an intermediate task and benefits from the complementary cues from audio to boost object classification and detection performance. We show that our methods, which update noisy ground truth and provide indirect and attention paths, greatly boosting performance on the AudioSet and VGGSound datasets compared to single-modality predictions, even ones that use contrastive learning. Our method outperforms previous weakly-supervised detectors for the task of object detection by reaching the state-of-art on AudioSet, and our sound localization module performs better than several state-of-art methods on AudioSet and MUSIC.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"51 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Complementary Cues from Audio Help Combat Noise in Weakly-Supervised Object Detection\",\"authors\":\"Cagri Gungor, Adriana Kovashka\",\"doi\":\"10.1109/WACV56688.2023.00222\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We tackle the problem of learning object detectors in a noisy environment, which is one of the significant challenges for weakly-supervised learning. We use multimodal learning to help localize objects of interest, but unlike other methods, we treat audio as an auxiliary modality that assists to tackle noise in detection from visual regions. First, we use the audio-visual model to generate new \\\"ground-truth\\\" labels for the training set to remove noise between the visual features and noisy supervision. Second, we propose an \\\"indirect path\\\" between audio and class predictions, which combines the link between visual and audio regions, and the link between visual features and predictions. Third, we propose a sound-based \\\"attention path\\\" which uses the benefit of complementary audio cues to identify important visual regions. We use contrastive learning to perform region-based audio-visual instance discrimination, which serves as an intermediate task and benefits from the complementary cues from audio to boost object classification and detection performance. We show that our methods, which update noisy ground truth and provide indirect and attention paths, greatly boosting performance on the AudioSet and VGGSound datasets compared to single-modality predictions, even ones that use contrastive learning. Our method outperforms previous weakly-supervised detectors for the task of object detection by reaching the state-of-art on AudioSet, and our sound localization module performs better than several state-of-art methods on AudioSet and MUSIC.\",\"PeriodicalId\":270631,\"journal\":{\"name\":\"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)\",\"volume\":\"51 6\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WACV56688.2023.00222\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV56688.2023.00222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

我们解决了在噪声环境中学习目标检测器的问题，这是弱监督学习的重大挑战之一。我们使用多模态学习来帮助定位感兴趣的对象，但与其他方法不同的是，我们将音频视为辅助模态，有助于处理视觉区域检测中的噪声。首先，我们使用视听模型为训练集生成新的“ground-truth”标签，以去除视觉特征和噪声监督之间的噪声。其次，我们提出了音频和类预测之间的“间接路径”，它结合了视觉和音频区域之间的联系，以及视觉特征和预测之间的联系。第三，我们提出了一种基于声音的“注意路径”，它利用互补的音频线索来识别重要的视觉区域。我们使用对比学习来执行基于区域的视听实例识别，它作为一个中间任务，利用来自音频的互补线索来提高目标分类和检测性能。我们表明，与单模态预测相比，我们的方法(更新嘈杂的地面真值并提供间接和注意力路径)大大提高了AudioSet和VGGSound数据集的性能，即使是使用对比学习的方法。通过在AudioSet上达到最新水平，我们的方法在对象检测任务上优于以前的弱监督检测器，并且我们的声音定位模块在AudioSet和MUSIC上的性能优于几种最新方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Complementary Cues from Audio Help Combat Noise in Weakly-Supervised Object Detection

We tackle the problem of learning object detectors in a noisy environment, which is one of the significant challenges for weakly-supervised learning. We use multimodal learning to help localize objects of interest, but unlike other methods, we treat audio as an auxiliary modality that assists to tackle noise in detection from visual regions. First, we use the audio-visual model to generate new "ground-truth" labels for the training set to remove noise between the visual features and noisy supervision. Second, we propose an "indirect path" between audio and class predictions, which combines the link between visual and audio regions, and the link between visual features and predictions. Third, we propose a sound-based "attention path" which uses the benefit of complementary audio cues to identify important visual regions. We use contrastive learning to perform region-based audio-visual instance discrimination, which serves as an intermediate task and benefits from the complementary cues from audio to boost object classification and detection performance. We show that our methods, which update noisy ground truth and provide indirect and attention paths, greatly boosting performance on the AudioSet and VGGSound datasets compared to single-modality predictions, even ones that use contrastive learning. Our method outperforms previous weakly-supervised detectors for the task of object detection by reaching the state-of-art on AudioSet, and our sound localization module performs better than several state-of-art methods on AudioSet and MUSIC.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

自引率

0.00%

发文量