BandControlNet:基于并行变换器的可转向流行音乐生成与细粒度时空特征

Jing Luo, Xinyu Yang, Dorien Herremans
{"title":"BandControlNet:基于并行变换器的可转向流行音乐生成与细粒度时空特征","authors":"Jing Luo, Xinyu Yang, Dorien Herremans","doi":"arxiv-2407.10462","DOIUrl":null,"url":null,"abstract":"Controllable music generation promotes the interaction between humans and\ncomposition systems by projecting the users' intent on their desired music. The\nchallenge of introducing controllability is an increasingly important issue in\nthe symbolic music generation field. When building controllable generative\npopular multi-instrument music systems, two main challenges typically present\nthemselves, namely weak controllability and poor music quality. To address\nthese issues, we first propose spatiotemporal features as powerful and\nfine-grained controls to enhance the controllability of the generative model.\nIn addition, an efficient music representation called REMI_Track is designed to\nconvert multitrack music into multiple parallel music sequences and shorten the\nsequence length of each track with Byte Pair Encoding (BPE) techniques.\nSubsequently, we release BandControlNet, a conditional model based on parallel\nTransformers, to tackle the multiple music sequences and generate high-quality\nmusic samples that are conditioned to the given spatiotemporal control\nfeatures. More concretely, the two specially designed modules of\nBandControlNet, namely structure-enhanced self-attention (SE-SA) and\nCross-Track Transformer (CTT), are utilized to strengthen the resulting musical\nstructure and inter-track harmony modeling respectively. Experimental results\ntested on two popular music datasets of different lengths demonstrate that the\nproposed BandControlNet outperforms other conditional music generation models\non most objective metrics in terms of fidelity and inference speed and shows\ngreat robustness in generating long music samples. The subjective evaluations\nshow BandControlNet trained on short datasets can generate music with\ncomparable quality to state-of-the-art models, while outperforming them\nsignificantly using longer datasets.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features\",\"authors\":\"Jing Luo, Xinyu Yang, Dorien Herremans\",\"doi\":\"arxiv-2407.10462\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Controllable music generation promotes the interaction between humans and\\ncomposition systems by projecting the users' intent on their desired music. The\\nchallenge of introducing controllability is an increasingly important issue in\\nthe symbolic music generation field. When building controllable generative\\npopular multi-instrument music systems, two main challenges typically present\\nthemselves, namely weak controllability and poor music quality. To address\\nthese issues, we first propose spatiotemporal features as powerful and\\nfine-grained controls to enhance the controllability of the generative model.\\nIn addition, an efficient music representation called REMI_Track is designed to\\nconvert multitrack music into multiple parallel music sequences and shorten the\\nsequence length of each track with Byte Pair Encoding (BPE) techniques.\\nSubsequently, we release BandControlNet, a conditional model based on parallel\\nTransformers, to tackle the multiple music sequences and generate high-quality\\nmusic samples that are conditioned to the given spatiotemporal control\\nfeatures. More concretely, the two specially designed modules of\\nBandControlNet, namely structure-enhanced self-attention (SE-SA) and\\nCross-Track Transformer (CTT), are utilized to strengthen the resulting musical\\nstructure and inter-track harmony modeling respectively. Experimental results\\ntested on two popular music datasets of different lengths demonstrate that the\\nproposed BandControlNet outperforms other conditional music generation models\\non most objective metrics in terms of fidelity and inference speed and shows\\ngreat robustness in generating long music samples. The subjective evaluations\\nshow BandControlNet trained on short datasets can generate music with\\ncomparable quality to state-of-the-art models, while outperforming them\\nsignificantly using longer datasets.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.10462\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.10462","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

可控音乐生成通过将用户的意图投射到所需音乐上,促进了人类与作曲系统之间的互动。在符号音乐生成领域,引入可控性是一个日益重要的挑战。在构建可控的流行多乐器音乐生成系统时,通常会遇到两大挑战,即可控性弱和音乐质量差。为了解决这些问题,我们首先提出了时空特征作为强大而精细的控制手段,以增强生成模型的可控性。此外,我们还设计了一种名为 REMI_Track 的高效音乐表示法,用于将多轨音乐转换为多个并行音乐序列,并利用字节对编码(BPE)技术缩短每个音轨的序列长度。随后,我们发布了基于并行变换器的条件模型 BandControlNet,用于处理多音乐序列,并生成符合给定时空控制特征的高质量音乐样本。更具体地说,BandControlNet 专门设计的两个模块,即结构增强自注意(SE-SA)和跨音轨变换器(CTT),分别用于加强生成的音乐结构和音轨间和声建模。在两个不同长度的流行音乐数据集上测试的实验结果表明,所提出的 BandControlNet 在保真度和推理速度等大多数客观指标上都优于其他条件音乐生成模型,并且在生成长音乐样本时表现出极大的鲁棒性。主观评估结果表明,在短数据集上训练的 BandControlNet 生成的音乐质量可与最先进的模型相媲美,而在使用长数据集时则明显优于它们。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features
Controllable music generation promotes the interaction between humans and composition systems by projecting the users' intent on their desired music. The challenge of introducing controllability is an increasingly important issue in the symbolic music generation field. When building controllable generative popular multi-instrument music systems, two main challenges typically present themselves, namely weak controllability and poor music quality. To address these issues, we first propose spatiotemporal features as powerful and fine-grained controls to enhance the controllability of the generative model. In addition, an efficient music representation called REMI_Track is designed to convert multitrack music into multiple parallel music sequences and shorten the sequence length of each track with Byte Pair Encoding (BPE) techniques. Subsequently, we release BandControlNet, a conditional model based on parallel Transformers, to tackle the multiple music sequences and generate high-quality music samples that are conditioned to the given spatiotemporal control features. More concretely, the two specially designed modules of BandControlNet, namely structure-enhanced self-attention (SE-SA) and Cross-Track Transformer (CTT), are utilized to strengthen the resulting musical structure and inter-track harmony modeling respectively. Experimental results tested on two popular music datasets of different lengths demonstrate that the proposed BandControlNet outperforms other conditional music generation models on most objective metrics in terms of fidelity and inference speed and shows great robustness in generating long music samples. The subjective evaluations show BandControlNet trained on short datasets can generate music with comparable quality to state-of-the-art models, while outperforming them significantly using longer datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信