{"title":"BandControlNet:基于并行变换器的可转向流行音乐生成与细粒度时空特征","authors":"Jing Luo, Xinyu Yang, Dorien Herremans","doi":"arxiv-2407.10462","DOIUrl":null,"url":null,"abstract":"Controllable music generation promotes the interaction between humans and\ncomposition systems by projecting the users' intent on their desired music. The\nchallenge of introducing controllability is an increasingly important issue in\nthe symbolic music generation field. When building controllable generative\npopular multi-instrument music systems, two main challenges typically present\nthemselves, namely weak controllability and poor music quality. To address\nthese issues, we first propose spatiotemporal features as powerful and\nfine-grained controls to enhance the controllability of the generative model.\nIn addition, an efficient music representation called REMI_Track is designed to\nconvert multitrack music into multiple parallel music sequences and shorten the\nsequence length of each track with Byte Pair Encoding (BPE) techniques.\nSubsequently, we release BandControlNet, a conditional model based on parallel\nTransformers, to tackle the multiple music sequences and generate high-quality\nmusic samples that are conditioned to the given spatiotemporal control\nfeatures. More concretely, the two specially designed modules of\nBandControlNet, namely structure-enhanced self-attention (SE-SA) and\nCross-Track Transformer (CTT), are utilized to strengthen the resulting musical\nstructure and inter-track harmony modeling respectively. Experimental results\ntested on two popular music datasets of different lengths demonstrate that the\nproposed BandControlNet outperforms other conditional music generation models\non most objective metrics in terms of fidelity and inference speed and shows\ngreat robustness in generating long music samples. The subjective evaluations\nshow BandControlNet trained on short datasets can generate music with\ncomparable quality to state-of-the-art models, while outperforming them\nsignificantly using longer datasets.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features\",\"authors\":\"Jing Luo, Xinyu Yang, Dorien Herremans\",\"doi\":\"arxiv-2407.10462\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Controllable music generation promotes the interaction between humans and\\ncomposition systems by projecting the users' intent on their desired music. The\\nchallenge of introducing controllability is an increasingly important issue in\\nthe symbolic music generation field. When building controllable generative\\npopular multi-instrument music systems, two main challenges typically present\\nthemselves, namely weak controllability and poor music quality. To address\\nthese issues, we first propose spatiotemporal features as powerful and\\nfine-grained controls to enhance the controllability of the generative model.\\nIn addition, an efficient music representation called REMI_Track is designed to\\nconvert multitrack music into multiple parallel music sequences and shorten the\\nsequence length of each track with Byte Pair Encoding (BPE) techniques.\\nSubsequently, we release BandControlNet, a conditional model based on parallel\\nTransformers, to tackle the multiple music sequences and generate high-quality\\nmusic samples that are conditioned to the given spatiotemporal control\\nfeatures. More concretely, the two specially designed modules of\\nBandControlNet, namely structure-enhanced self-attention (SE-SA) and\\nCross-Track Transformer (CTT), are utilized to strengthen the resulting musical\\nstructure and inter-track harmony modeling respectively. Experimental results\\ntested on two popular music datasets of different lengths demonstrate that the\\nproposed BandControlNet outperforms other conditional music generation models\\non most objective metrics in terms of fidelity and inference speed and shows\\ngreat robustness in generating long music samples. The subjective evaluations\\nshow BandControlNet trained on short datasets can generate music with\\ncomparable quality to state-of-the-art models, while outperforming them\\nsignificantly using longer datasets.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.10462\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.10462","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features
Controllable music generation promotes the interaction between humans and
composition systems by projecting the users' intent on their desired music. The
challenge of introducing controllability is an increasingly important issue in
the symbolic music generation field. When building controllable generative
popular multi-instrument music systems, two main challenges typically present
themselves, namely weak controllability and poor music quality. To address
these issues, we first propose spatiotemporal features as powerful and
fine-grained controls to enhance the controllability of the generative model.
In addition, an efficient music representation called REMI_Track is designed to
convert multitrack music into multiple parallel music sequences and shorten the
sequence length of each track with Byte Pair Encoding (BPE) techniques.
Subsequently, we release BandControlNet, a conditional model based on parallel
Transformers, to tackle the multiple music sequences and generate high-quality
music samples that are conditioned to the given spatiotemporal control
features. More concretely, the two specially designed modules of
BandControlNet, namely structure-enhanced self-attention (SE-SA) and
Cross-Track Transformer (CTT), are utilized to strengthen the resulting musical
structure and inter-track harmony modeling respectively. Experimental results
tested on two popular music datasets of different lengths demonstrate that the
proposed BandControlNet outperforms other conditional music generation models
on most objective metrics in terms of fidelity and inference speed and shows
great robustness in generating long music samples. The subjective evaluations
show BandControlNet trained on short datasets can generate music with
comparable quality to state-of-the-art models, while outperforming them
significantly using longer datasets.