Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu
{"title":"莫扎特的触摸基于预训练大型模型的轻量级多模态音乐生成框架","authors":"Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu","doi":"arxiv-2405.02801","DOIUrl":null,"url":null,"abstract":"In recent years, AI-Generated Content (AIGC) has witnessed rapid\nadvancements, facilitating the generation of music, images, and other forms of\nartistic expression across various industries. However, researches on general\nmulti-modal music generation model remain scarce. To fill this gap, we propose\na multi-modal music generation framework Mozart's Touch. It could generate\naligned music with the cross-modality inputs, such as images, videos and text.\nMozart's Touch is composed of three main components: Multi-modal Captioning\nModule, Large Language Model (LLM) Understanding & Bridging Module, and Music\nGeneration Module. Unlike traditional approaches, Mozart's Touch requires no\ntraining or fine-tuning pre-trained models, offering efficiency and\ntransparency through clear, interpretable prompts. We also introduce\n\"LLM-Bridge\" method to resolve the heterogeneous representation problems\nbetween descriptive texts of different modalities. We conduct a series of\nobjective and subjective evaluations on the proposed model, and results\nindicate that our model surpasses the performance of current state-of-the-art\nmodels. Our codes and examples is availble at:\nhttps://github.com/WangTooNaive/MozartsTouch","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models\",\"authors\":\"Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu\",\"doi\":\"arxiv-2405.02801\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, AI-Generated Content (AIGC) has witnessed rapid\\nadvancements, facilitating the generation of music, images, and other forms of\\nartistic expression across various industries. However, researches on general\\nmulti-modal music generation model remain scarce. To fill this gap, we propose\\na multi-modal music generation framework Mozart's Touch. It could generate\\naligned music with the cross-modality inputs, such as images, videos and text.\\nMozart's Touch is composed of three main components: Multi-modal Captioning\\nModule, Large Language Model (LLM) Understanding & Bridging Module, and Music\\nGeneration Module. Unlike traditional approaches, Mozart's Touch requires no\\ntraining or fine-tuning pre-trained models, offering efficiency and\\ntransparency through clear, interpretable prompts. We also introduce\\n\\\"LLM-Bridge\\\" method to resolve the heterogeneous representation problems\\nbetween descriptive texts of different modalities. We conduct a series of\\nobjective and subjective evaluations on the proposed model, and results\\nindicate that our model surpasses the performance of current state-of-the-art\\nmodels. Our codes and examples is availble at:\\nhttps://github.com/WangTooNaive/MozartsTouch\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.02801\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
近年来,人工智能生成内容(AI-Generated Content,AIGC)发展迅速,促进了各行各业音乐、图像和其他艺术表现形式的生成。然而,关于通用多模态音乐生成模型的研究仍然很少。为了填补这一空白,我们提出了多模态音乐生成框架 "莫扎特的触摸"。莫扎特之触由三个主要部分组成:Mozart's Touch 由三个主要部分组成:多模态字幕模块(Multi-modal CaptioningModule)、大型语言模型(LLM)理解与桥接模块(Large Language Model Understanding & Bridging Module)和音乐生成模块(MusicGeneration Module)。与传统方法不同,莫扎特之音不需要对预先训练好的模型进行训练或微调,而是通过清晰、可解释的提示来提供效率和透明度。我们还引入了 "LLM-Bridge "方法,以解决不同模式的描述性文本之间的异构表示问题。我们对所提出的模型进行了一系列客观和主观评估,结果表明我们的模型超越了当前最先进模型的性能。我们的代码和示例可在以下网址获取:https://github.com/WangTooNaive/MozartsTouch
Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models
In recent years, AI-Generated Content (AIGC) has witnessed rapid
advancements, facilitating the generation of music, images, and other forms of
artistic expression across various industries. However, researches on general
multi-modal music generation model remain scarce. To fill this gap, we propose
a multi-modal music generation framework Mozart's Touch. It could generate
aligned music with the cross-modality inputs, such as images, videos and text.
Mozart's Touch is composed of three main components: Multi-modal Captioning
Module, Large Language Model (LLM) Understanding & Bridging Module, and Music
Generation Module. Unlike traditional approaches, Mozart's Touch requires no
training or fine-tuning pre-trained models, offering efficiency and
transparency through clear, interpretable prompts. We also introduce
"LLM-Bridge" method to resolve the heterogeneous representation problems
between descriptive texts of different modalities. We conduct a series of
objective and subjective evaluations on the proposed model, and results
indicate that our model surpasses the performance of current state-of-the-art
models. Our codes and examples is availble at:
https://github.com/WangTooNaive/MozartsTouch