Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2019-05-12 DOI:10.1109/ICASSP.2019.8682475

J. Cramer, Ho-Hsiang Wu, J. Salamon, J. Bello

{"title":"Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings","authors":"J. Cramer, Ho-Hsiang Wu, J. Salamon, J. Bello","doi":"10.1109/ICASSP.2019.8682475","DOIUrl":null,"url":null,"abstract":"A considerable challenge in applying deep learning to audio classification is the scarcity of labeled data. An increasingly popular solution is to learn deep audio embeddings from large audio collections and use them to train shallow classifiers using small labeled datasets. Look, Listen, and Learn (L3-Net) is an embedding trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data. This framework has the potential to produce powerful out-of-the-box embeddings for downstream audio classification tasks, but has a number of unexplained design choices that may impact the embeddings’ behavior. In this paper we investigate how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings. We show that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key. Surprisingly, we find that matching the content for training the embedding to the downstream task is not beneficial. Finally, we show that our best variant of the L3-Net embedding outperforms both the VGGish and SoundNet embeddings, while having fewer parameters and being trained on less data. Our implementation of the L3-Net embedding model as well as pre-trained models are made freely available online.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"5 1","pages":"3852-3856"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"223","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2019.8682475","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 223

Abstract

A considerable challenge in applying deep learning to audio classification is the scarcity of labeled data. An increasingly popular solution is to learn deep audio embeddings from large audio collections and use them to train shallow classifiers using small labeled datasets. Look, Listen, and Learn (L3-Net) is an embedding trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data. This framework has the potential to produce powerful out-of-the-box embeddings for downstream audio classification tasks, but has a number of unexplained design choices that may impact the embeddings’ behavior. In this paper we investigate how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings. We show that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key. Surprisingly, we find that matching the content for training the embedding to the downstream task is not beneficial. Finally, we show that our best variant of the L3-Net embedding outperforms both the VGGish and SoundNet embeddings, while having fewer parameters and being trained on less data. Our implementation of the L3-Net embedding model as well as pre-trained models are made freely available online.

查看原文本刊更多论文

看、听和学习更多:深度音频嵌入的设计选择

将深度学习应用于音频分类的一个相当大的挑战是标记数据的稀缺性。一个日益流行的解决方案是从大型音频集合中学习深度音频嵌入，并使用它们来训练使用小标记数据集的浅分类器。看、听、学(L3-Net)是一种通过视频中视听对应的自监督学习训练的嵌入，而不是其他需要标记数据的嵌入。该框架有潜力为下游音频分类任务生成强大的开箱即用嵌入，但有许多无法解释的设计选择可能会影响嵌入的行为。在本文中，我们研究了L3-Net设计选择如何影响使用这些嵌入训练的下游音频分类器的性能。我们表明，输入表示的音频信息选择是重要的，并且使用足够的数据来训练嵌入是关键。令人惊讶的是，我们发现将训练嵌入的内容与下游任务匹配是无益的。最后，我们证明了L3-Net嵌入的最佳变体优于VGGish和SoundNet嵌入，同时具有更少的参数和更少的数据进行训练。我们的L3-Net嵌入模型的实现以及预训练模型在网上免费提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量