JukeDrummer:通过Transformer VQ-VAE生成条件节拍感知音频域鼓伴奏

International Society for Music Information Retrieval Conference Pub Date : 2022-10-12 DOI:10.48550/arXiv.2210.06007

Yueh-Kao Wu, Ching-Yu Chiu, Yi-Hsuan Yang

{"title":"JukeDrummer:通过Transformer VQ-VAE生成条件节拍感知音频域鼓伴奏","authors":"Yueh-Kao Wu, Ching-Yu Chiu, Yi-Hsuan Yang","doi":"10.48550/arXiv.2210.06007","DOIUrl":null,"url":null,"abstract":"This paper proposes a model that generates a drum track in the audio domain to play along to a user-provided drum-free recording. Specifically, using paired data of drumless tracks and the corresponding human-made drum tracks, we train a Transformer model to improvise the drum part of an unseen drumless recording. We combine two approaches to encode the input audio. First, we train a vector-quantized variational autoencoder (VQ-VAE) to represent the input audio with discrete codes, which can then be readily used in a Transformer. Second, using an audio-domain beat tracking model, we compute beat-related features of the input audio and use them as embeddings in the Transformer. Instead of generating the drum track directly as waveforms, we use a separate VQ-VAE to encode the mel-spectrogram of a drum track into another set of discrete codes, and train the Transformer to predict the sequence of drum-related discrete codes. The output codes are then converted to a mel-spectrogram with a decoder, and then to the waveform with a vocoder. We report both objective and subjective evaluations of variants of the proposed model, demonstrating that the model with beat information generates drum accompaniment that is rhythmically and stylistically consistent with the input audio.","PeriodicalId":309903,"journal":{"name":"International Society for Music Information Retrieval Conference","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VAE\",\"authors\":\"Yueh-Kao Wu, Ching-Yu Chiu, Yi-Hsuan Yang\",\"doi\":\"10.48550/arXiv.2210.06007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a model that generates a drum track in the audio domain to play along to a user-provided drum-free recording. Specifically, using paired data of drumless tracks and the corresponding human-made drum tracks, we train a Transformer model to improvise the drum part of an unseen drumless recording. We combine two approaches to encode the input audio. First, we train a vector-quantized variational autoencoder (VQ-VAE) to represent the input audio with discrete codes, which can then be readily used in a Transformer. Second, using an audio-domain beat tracking model, we compute beat-related features of the input audio and use them as embeddings in the Transformer. Instead of generating the drum track directly as waveforms, we use a separate VQ-VAE to encode the mel-spectrogram of a drum track into another set of discrete codes, and train the Transformer to predict the sequence of drum-related discrete codes. The output codes are then converted to a mel-spectrogram with a decoder, and then to the waveform with a vocoder. We report both objective and subjective evaluations of variants of the proposed model, demonstrating that the model with beat information generates drum accompaniment that is rhythmically and stylistically consistent with the input audio.\",\"PeriodicalId\":309903,\"journal\":{\"name\":\"International Society for Music Information Retrieval Conference\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Society for Music Information Retrieval Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2210.06007\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Society for Music Information Retrieval Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.06007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

本文提出了一个模型，该模型在音频域中生成一个鼓轨道，并随用户提供的无鼓录音一起播放。具体来说，利用无鼓音轨和相应的人造鼓音轨的配对数据，我们训练了一个Transformer模型来即兴创作一个看不见的无鼓录音的鼓部分。我们结合两种方法对输入音频进行编码。首先，我们训练一个矢量量化变分自编码器(VQ-VAE)，用离散编码表示输入音频，然后可以很容易地在变压器中使用。其次，使用音频域温度跟踪模型，我们计算输入音频的温度相关特征，并将它们用作Transformer中的嵌入。我们没有直接以波形的形式生成鼓轨，而是使用单独的VQ-VAE将鼓轨的梅尔谱图编码为另一组离散码，并训练Transformer来预测与鼓相关的离散码序列。然后用解码器将输出代码转换为梅尔谱图，然后用声码器转换为波形。我们报告了对所提出模型变体的客观和主观评估，证明带有节拍信息的模型生成的鼓声伴奏在节奏和风格上与输入音频一致。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VAE

This paper proposes a model that generates a drum track in the audio domain to play along to a user-provided drum-free recording. Specifically, using paired data of drumless tracks and the corresponding human-made drum tracks, we train a Transformer model to improvise the drum part of an unseen drumless recording. We combine two approaches to encode the input audio. First, we train a vector-quantized variational autoencoder (VQ-VAE) to represent the input audio with discrete codes, which can then be readily used in a Transformer. Second, using an audio-domain beat tracking model, we compute beat-related features of the input audio and use them as embeddings in the Transformer. Instead of generating the drum track directly as waveforms, we use a separate VQ-VAE to encode the mel-spectrogram of a drum track into another set of discrete codes, and train the Transformer to predict the sequence of drum-related discrete codes. The output codes are then converted to a mel-spectrogram with a decoder, and then to the waveform with a vocoder. We report both objective and subjective evaluations of variants of the proposed model, demonstrating that the model with beat information generates drum accompaniment that is rhythmically and stylistically consistent with the input audio.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Society for Music Information Retrieval Conference

自引率

0.00%

发文量