双任务单音歌唱转录

IF 1.6 4区工程技术 Q3 ACOUSTICS

Journal of the Audio Engineering Society Pub Date : 2022-12-12 DOI:10.17743/jaes.2022.0040

Markus Schwabe, Sebastian Murgul, M. Heizmann

{"title":"双任务单音歌唱转录","authors":"Markus Schwabe, Sebastian Murgul, M. Heizmann","doi":"10.17743/jaes.2022.0040","DOIUrl":null,"url":null,"abstract":"Automatic music transcription with note level output is a current task in the field of music information retrieval. In contrast to the piano case with very good results using available large datasets, transcription of non-professional singing has been rarely investigated with deep learning approaches because of the lack of note level annotated datasets. In this work, two datasets are created concerning amateur singing recordings, one for training (synthetic singing dataset) and one for the evaluation task (SingReal dataset). The synthetic training dataset is generated by synthesizing a large scale of vocal melodies from artificial songs. Because the evaluation should represent a realistic scenario, the SingReal dataset is created from real recordings of non-professional singers. To transcribe singing notes, a new method called Dual Task Monophonic Singing Transcription is proposed, which divides the problem of singing transcription into the two subtasks onset detection and pitch estimation, realized by two small independent neural networks. This approach achieves a note level F1 score of 74.19% on the SingReal dataset, outperforming all state of the art transcription systems investigated with at least 3.5% improvement. Furthermore, Dual Task Monophonic Singing Transcription can be adapted very easily to the real-time transcription case.","PeriodicalId":50008,"journal":{"name":"Journal of the Audio Engineering Society","volume":" ","pages":""},"PeriodicalIF":1.6000,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dual Task Monophonic Singing Transcription\",\"authors\":\"Markus Schwabe, Sebastian Murgul, M. Heizmann\",\"doi\":\"10.17743/jaes.2022.0040\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic music transcription with note level output is a current task in the field of music information retrieval. In contrast to the piano case with very good results using available large datasets, transcription of non-professional singing has been rarely investigated with deep learning approaches because of the lack of note level annotated datasets. In this work, two datasets are created concerning amateur singing recordings, one for training (synthetic singing dataset) and one for the evaluation task (SingReal dataset). The synthetic training dataset is generated by synthesizing a large scale of vocal melodies from artificial songs. Because the evaluation should represent a realistic scenario, the SingReal dataset is created from real recordings of non-professional singers. To transcribe singing notes, a new method called Dual Task Monophonic Singing Transcription is proposed, which divides the problem of singing transcription into the two subtasks onset detection and pitch estimation, realized by two small independent neural networks. This approach achieves a note level F1 score of 74.19% on the SingReal dataset, outperforming all state of the art transcription systems investigated with at least 3.5% improvement. Furthermore, Dual Task Monophonic Singing Transcription can be adapted very easily to the real-time transcription case.\",\"PeriodicalId\":50008,\"journal\":{\"name\":\"Journal of the Audio Engineering Society\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2022-12-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the Audio Engineering Society\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.17743/jaes.2022.0040\",\"RegionNum\":4,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Audio Engineering Society","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.17743/jaes.2022.0040","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

具有音符级输出的自动音乐转录是音乐信息检索领域中的一项当前任务。与使用可用的大型数据集获得非常好结果的钢琴案例相比，由于缺乏音符级注释数据集，很少使用深度学习方法研究非专业歌唱的转录。在这项工作中，创建了两个关于业余歌唱记录的数据集，一个用于训练（合成歌唱数据集），另一个用于评估任务（SingReal数据集）。合成训练数据集是通过从人工歌曲中合成大规模的声乐旋律来生成的。因为评估应该代表一个现实的场景，所以SingReal数据集是根据非专业歌手的真实录音创建的。为了转录歌唱音符，提出了一种新的方法，称为双任务单音歌唱转录，该方法将歌唱转录问题分为两个子任务起始检测和音高估计，由两个小型独立神经网络实现。这种方法在SingReal数据集上获得了74.19%的音符级F1分数，优于所研究的所有最先进的转录系统，至少提高了3.5%。此外，双任务单音歌唱转录可以很容易地适应实时转录的情况。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dual Task Monophonic Singing Transcription

Automatic music transcription with note level output is a current task in the field of music information retrieval. In contrast to the piano case with very good results using available large datasets, transcription of non-professional singing has been rarely investigated with deep learning approaches because of the lack of note level annotated datasets. In this work, two datasets are created concerning amateur singing recordings, one for training (synthetic singing dataset) and one for the evaluation task (SingReal dataset). The synthetic training dataset is generated by synthesizing a large scale of vocal melodies from artificial songs. Because the evaluation should represent a realistic scenario, the SingReal dataset is created from real recordings of non-professional singers. To transcribe singing notes, a new method called Dual Task Monophonic Singing Transcription is proposed, which divides the problem of singing transcription into the two subtasks onset detection and pitch estimation, realized by two small independent neural networks. This approach achieves a note level F1 score of 74.19% on the SingReal dataset, outperforming all state of the art transcription systems investigated with at least 3.5% improvement. Furthermore, Dual Task Monophonic Singing Transcription can be adapted very easily to the real-time transcription case.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the Audio Engineering Society 工程技术-工程：综合

CiteScore

3.50

自引率

14.30%

发文量

审稿时长

1 months

期刊介绍： The Journal of the Audio Engineering Society — the official publication of the AES — is the only peer-reviewed journal devoted exclusively to audio technology. Published 10 times each year, it is available to all AES members and subscribers. The Journal contains state-of-the-art technical papers and engineering reports; feature articles covering timely topics; pre and post reports of AES conventions and other society activities; news from AES sections around the world; Standards and Education Committee work; membership news, patents, new products, and newsworthy developments in the field of audio.