Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems

Luis Carvalho, Tobias Washüttl, G. Widmer
{"title":"Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems","authors":"Luis Carvalho, Tobias Washüttl, G. Widmer","doi":"10.1145/3587819.3590968","DOIUrl":null,"url":null,"abstract":"Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pretrained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models. Code and trained models are accessible at https://github.com/luisfvc/ucasr.","PeriodicalId":330983,"journal":{"name":"Proceedings of the 14th Conference on ACM Multimedia Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th Conference on ACM Multimedia Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3587819.3590968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pretrained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models. Code and trained models are accessible at https://github.com/luisfvc/ucasr.
鲁棒乐谱检索系统的自监督对比学习
将乐谱图像与录音连接起来仍然是开发高效的跨模态音乐检索系统的关键问题。完成这项任务的基本方法之一是通过深度神经网络学习跨模态嵌入空间,该网络能够连接音频片段和乐谱。然而,来自真实音乐内容的标注数据的稀缺性影响了这些方法推广到真实检索场景的能力。在这项工作中,我们通过将网络暴露于大量真实音乐数据中作为预训练步骤,通过对比两种模式(即音频和薄片图像)片段的随机增强视图,研究我们是否可以通过自监督对比学习来减轻这种限制。通过对合成钢琴和真实钢琴数据的大量实验,我们表明,在所有场景和预训练配置下,预训练模型都能够以更好的精度检索片段。受这些结果的鼓舞,我们将片段嵌入应用于跨模态片段识别的更高级别任务中,并在几种检索配置上进行了更多的实验。在这个任务中,我们观察到当真实音乐数据存在时,检索质量从30%提高到100%。最后,我们论证了自我监督对比学习在缓解多模态音乐检索模型中标注数据稀缺性方面的潜力。代码和经过训练的模型可在https://github.com/luisfvc/ucasr上访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信