学习复调音频与乐谱对齐的最佳功能

IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-10-01 DOI:10.1109/TASL.2013.2266794

C. Joder, S. Essid, G. Richard

{"title":"学习复调音频与乐谱对齐的最佳功能","authors":"C. Joder, S. Essid, G. Richard","doi":"10.1109/TASL.2013.2266794","DOIUrl":null,"url":null,"abstract":"This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2118-2128"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2266794","citationCount":"30","resultStr":"{\"title\":\"Learning Optimal Features for Polyphonic Audio-to-Score Alignment\",\"authors\":\"C. Joder, S. Essid, G. Richard\",\"doi\":\"10.1109/TASL.2013.2266794\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system.\",\"PeriodicalId\":55014,\"journal\":{\"name\":\"IEEE Transactions on Audio Speech and Language Processing\",\"volume\":\"21 1\",\"pages\":\"2118-2128\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/TASL.2013.2266794\",\"citationCount\":\"30\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Audio Speech and Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TASL.2013.2266794\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Audio Speech and Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASL.2013.2266794","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

摘要

本文讨论了特征函数的设计，用于将音乐录音与作品(乐谱)的符号表示相匹配。这些特征函数被定义为音频观察和对应于分数的模板向量之间的不相似性度量。通过将模板构造表示为从符号到音频表示的线性映射，可以通过优化线性变换来学习特征函数。本文探讨了两种不同的学习策略。第一个使用最佳拟合标准(最小散度)，而第二个利用基于条件随机场模型的判别框架(最大似然标准)。我们在一个大型的流行和古典复调音乐数据库中评估了特征函数在音频-乐谱对齐任务中的影响。结果表明，对于几种类型的模型，使用不同的时间约束，学习映射具有优于经典启发式映射的潜力。在这个校准任务中，比较了音频观测的几种表示以及几种距离函数。我们的实验选择对称的Kullback-Leibler散度。此外，光谱图和基于cqt的表示都提供了非常精确的对准，在我们最复杂的系统中，检测超过97%的发作，精度为100毫秒。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning Optimal Features for Polyphonic Audio-to-Score Alignment

This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Audio Speech and Language Processing 工程技术-工程：电子与电气

自引率

0.00%

发文量

审稿时长

24.0 months

期刊介绍： The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.