基于CRNN-HSMM混合模型的音频到乐谱唱歌转录

IF 3.2 Q1 Computer Science

APSIPA Transactions on Signal and Information Processing Pub Date : 2021-04-20 DOI:10.1017/ATSIP.2021.4

Ryo Nishikimi, Eita Nakamura, Masataka Goto, Kazuyoshi Yoshii

{"title":"基于CRNN-HSMM混合模型的音频到乐谱唱歌转录","authors":"Ryo Nishikimi, Eita Nakamura, Masataka Goto, Kazuyoshi Yoshii","doi":"10.1017/ATSIP.2021.4","DOIUrl":null,"url":null,"abstract":"This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":" ","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2021-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2021.4","citationCount":"10","resultStr":"{\"title\":\"Audio-to-score singing transcription based on a CRNN-HSMM hybrid model\",\"authors\":\"Ryo Nishikimi, Eita Nakamura, Masataka Goto, Kazuyoshi Yoshii\",\"doi\":\"10.1017/ATSIP.2021.4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.\",\"PeriodicalId\":44812,\"journal\":{\"name\":\"APSIPA Transactions on Signal and Information Processing\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2021-04-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1017/ATSIP.2021.4\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"APSIPA Transactions on Signal and Information Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1017/ATSIP.2021.4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"APSIPA Transactions on Signal and Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1017/ATSIP.2021.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 10

摘要

本文描述了一种自动歌唱转录（AST）方法，该方法根据输入音乐信号估计歌唱旋律的人类可读乐谱。由于歌声的音高和时间变化很大，估计F0轮廓并用估计的状态时间对其进行量化的天真级联方法无法避免许多音高和节奏误差。为了解决这个问题，我们建立了一个统一的音乐信号生成模型，该模型由表示以音乐键为条件的潜在音符生成过程的半马尔可夫语言模型和基于卷积递归神经网络（CRNN）的声学模型组成，该模型表示从音符中观察到的音乐信号的生成过程。由此产生的CRNN-HSMM混合模型使我们能够使用维特比算法从音乐信号中估计最可能的音符，同时利用关于音符的语法知识和CRNN的表达能力。实验结果表明，所提出的方法优于传统的最先进的方法，音乐语言模型与声学模型的集成对AST性能有积极影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

APSIPA Transactions on Signal and Information Processing ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.60

自引率

6.20%

发文量

审稿时长

40 weeks