用于音乐理解的音频表示的监督和非监督学习

International Society for Music Information Retrieval Conference Pub Date : 2022-10-07 DOI:10.48550/arXiv.2210.03799

Matthew C. McCallum, Filip Korzeniowski, Sergio Oramas, F. Gouyon, Andreas F. Ehmann

{"title":"用于音乐理解的音频表示的监督和非监督学习","authors":"Matthew C. McCallum, Filip Korzeniowski, Sergio Oramas, F. Gouyon, Andreas F. Ehmann","doi":"10.48550/arXiv.2210.03799","DOIUrl":null,"url":null,"abstract":"In this work, we provide a broad comparative analysis of strategies for pre-training audio understanding models for several tasks in the music domain, including labelling of genre, era, origin, mood, instrumentation, key, pitch, vocal characteristics, tempo and sonority. Specifically, we explore how the domain of pre-training datasets (music or generic audio) and the pre-training methodology (supervised or unsupervised) affects the adequacy of the resulting audio embeddings for downstream tasks. We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance in a wide range of music labelling tasks, each with novel content and vocabularies. This can be done in an efficient manner with models containing less than 100 million parameters that require no fine-tuning or reparameterization for downstream tasks, making this approach practical for industry-scale audio catalogs. Within the class of unsupervised learning strategies, we show that the domain of the training dataset can significantly impact the performance of representations learned by the model. We find that restricting the domain of the pre-training dataset to music allows for training with smaller batch sizes while achieving state-of-the-art in unsupervised learning -- and in some cases, supervised learning -- for music understanding. We also corroborate that, while achieving state-of-the-art performance on many tasks, supervised learning can cause models to specialize to the supervised information provided, somewhat compromising a model's generality.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Supervised and Unsupervised Learning of Audio Representations for Music Understanding\",\"authors\":\"Matthew C. McCallum, Filip Korzeniowski, Sergio Oramas, F. Gouyon, Andreas F. Ehmann\",\"doi\":\"10.48550/arXiv.2210.03799\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we provide a broad comparative analysis of strategies for pre-training audio understanding models for several tasks in the music domain, including labelling of genre, era, origin, mood, instrumentation, key, pitch, vocal characteristics, tempo and sonority. Specifically, we explore how the domain of pre-training datasets (music or generic audio) and the pre-training methodology (supervised or unsupervised) affects the adequacy of the resulting audio embeddings for downstream tasks. We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance in a wide range of music labelling tasks, each with novel content and vocabularies. This can be done in an efficient manner with models containing less than 100 million parameters that require no fine-tuning or reparameterization for downstream tasks, making this approach practical for industry-scale audio catalogs. Within the class of unsupervised learning strategies, we show that the domain of the training dataset can significantly impact the performance of representations learned by the model. We find that restricting the domain of the pre-training dataset to music allows for training with smaller batch sizes while achieving state-of-the-art in unsupervised learning -- and in some cases, supervised learning -- for music understanding. We also corroborate that, while achieving state-of-the-art performance on many tasks, supervised learning can cause models to specialize to the supervised information provided, somewhat compromising a model's generality.\",\"PeriodicalId\":309903,\"journal\":{\"name\":\"International Society for Music Information Retrieval Conference\",\"volume\":\"87 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Society for Music Information Retrieval Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.03799\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Society for Music Information Retrieval Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.03799","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

在这项工作中，我们对音乐领域的几个任务的预训练音频理解模型的策略进行了广泛的比较分析，包括类型、时代、起源、情绪、乐器、关键、音调、声乐特征、速度和响度的标记。具体来说，我们探讨了预训练数据集的领域(音乐或通用音频)和预训练方法(监督或无监督)如何影响下游任务的结果音频嵌入的充分性。我们表明，在大规模专家注释的音乐数据集上通过监督学习训练的模型在广泛的音乐标签任务中实现了最先进的性能，每个任务都具有新颖的内容和词汇表。这可以通过包含少于1亿个参数的模型有效地完成，不需要对下游任务进行微调或重新参数化，使这种方法适用于工业规模的音频目录。在无监督学习策略的类别中，我们证明了训练数据集的域可以显著影响模型学习到的表示的性能。我们发现，将预训练数据集的领域限制在音乐上，可以在实现无监督学习(在某些情况下是监督学习)的最先进水平的同时，以更小的批处理规模进行训练，以实现音乐理解。我们还证实，虽然在许多任务上实现了最先进的性能，但监督学习可能导致模型专注于所提供的监督信息，这在一定程度上损害了模型的通用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Supervised and Unsupervised Learning of Audio Representations for Music Understanding

In this work, we provide a broad comparative analysis of strategies for pre-training audio understanding models for several tasks in the music domain, including labelling of genre, era, origin, mood, instrumentation, key, pitch, vocal characteristics, tempo and sonority. Specifically, we explore how the domain of pre-training datasets (music or generic audio) and the pre-training methodology (supervised or unsupervised) affects the adequacy of the resulting audio embeddings for downstream tasks. We show that models trained via supervised learning on large-scale expert-annotated music datasets achieve state-of-the-art performance in a wide range of music labelling tasks, each with novel content and vocabularies. This can be done in an efficient manner with models containing less than 100 million parameters that require no fine-tuning or reparameterization for downstream tasks, making this approach practical for industry-scale audio catalogs. Within the class of unsupervised learning strategies, we show that the domain of the training dataset can significantly impact the performance of representations learned by the model. We find that restricting the domain of the pre-training dataset to music allows for training with smaller batch sizes while achieving state-of-the-art in unsupervised learning -- and in some cases, supervised learning -- for music understanding. We also corroborate that, while achieving state-of-the-art performance on many tasks, supervised learning can cause models to specialize to the supervised information provided, somewhat compromising a model's generality.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Society for Music Information Retrieval Conference

自引率

0.00%

发文量