Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems

Proceedings of the 14th Conference on ACM Multimedia Systems Pub Date : 2023-06-07 DOI:10.1145/3587819.3590968

Luis Carvalho, Tobias Washüttl, G. Widmer

{"title":"Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems","authors":"Luis Carvalho, Tobias Washüttl, G. Widmer","doi":"10.1145/3587819.3590968","DOIUrl":null,"url":null,"abstract":"Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pretrained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models. Code and trained models are accessible at https://github.com/luisfvc/ucasr.","PeriodicalId":330983,"journal":{"name":"Proceedings of the 14th Conference on ACM Multimedia Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th Conference on ACM Multimedia Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3587819.3590968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pretrained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models. Code and trained models are accessible at https://github.com/luisfvc/ucasr.

查看原文本刊更多论文

鲁棒乐谱检索系统的自监督对比学习

将乐谱图像与录音连接起来仍然是开发高效的跨模态音乐检索系统的关键问题。完成这项任务的基本方法之一是通过深度神经网络学习跨模态嵌入空间，该网络能够连接音频片段和乐谱。然而，来自真实音乐内容的标注数据的稀缺性影响了这些方法推广到真实检索场景的能力。在这项工作中，我们通过将网络暴露于大量真实音乐数据中作为预训练步骤，通过对比两种模式(即音频和薄片图像)片段的随机增强视图，研究我们是否可以通过自监督对比学习来减轻这种限制。通过对合成钢琴和真实钢琴数据的大量实验，我们表明，在所有场景和预训练配置下，预训练模型都能够以更好的精度检索片段。受这些结果的鼓舞，我们将片段嵌入应用于跨模态片段识别的更高级别任务中，并在几种检索配置上进行了更多的实验。在这个任务中，我们观察到当真实音乐数据存在时，检索质量从30%提高到100%。最后，我们论证了自我监督对比学习在缓解多模态音乐检索模型中标注数据稀缺性方面的潜力。代码和经过训练的模型可在https://github.com/luisfvc/ucasr上访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 14th Conference on ACM Multimedia Systems

自引率

0.00%

发文量