莫扎特的触摸基于预训练大型模型的轻量级多模态音乐生成框架

Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu
{"title":"莫扎特的触摸基于预训练大型模型的轻量级多模态音乐生成框架","authors":"Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu","doi":"arxiv-2405.02801","DOIUrl":null,"url":null,"abstract":"In recent years, AI-Generated Content (AIGC) has witnessed rapid\nadvancements, facilitating the generation of music, images, and other forms of\nartistic expression across various industries. However, researches on general\nmulti-modal music generation model remain scarce. To fill this gap, we propose\na multi-modal music generation framework Mozart's Touch. It could generate\naligned music with the cross-modality inputs, such as images, videos and text.\nMozart's Touch is composed of three main components: Multi-modal Captioning\nModule, Large Language Model (LLM) Understanding & Bridging Module, and Music\nGeneration Module. Unlike traditional approaches, Mozart's Touch requires no\ntraining or fine-tuning pre-trained models, offering efficiency and\ntransparency through clear, interpretable prompts. We also introduce\n\"LLM-Bridge\" method to resolve the heterogeneous representation problems\nbetween descriptive texts of different modalities. We conduct a series of\nobjective and subjective evaluations on the proposed model, and results\nindicate that our model surpasses the performance of current state-of-the-art\nmodels. Our codes and examples is availble at:\nhttps://github.com/WangTooNaive/MozartsTouch","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models\",\"authors\":\"Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu\",\"doi\":\"arxiv-2405.02801\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, AI-Generated Content (AIGC) has witnessed rapid\\nadvancements, facilitating the generation of music, images, and other forms of\\nartistic expression across various industries. However, researches on general\\nmulti-modal music generation model remain scarce. To fill this gap, we propose\\na multi-modal music generation framework Mozart's Touch. It could generate\\naligned music with the cross-modality inputs, such as images, videos and text.\\nMozart's Touch is composed of three main components: Multi-modal Captioning\\nModule, Large Language Model (LLM) Understanding & Bridging Module, and Music\\nGeneration Module. Unlike traditional approaches, Mozart's Touch requires no\\ntraining or fine-tuning pre-trained models, offering efficiency and\\ntransparency through clear, interpretable prompts. We also introduce\\n\\\"LLM-Bridge\\\" method to resolve the heterogeneous representation problems\\nbetween descriptive texts of different modalities. We conduct a series of\\nobjective and subjective evaluations on the proposed model, and results\\nindicate that our model surpasses the performance of current state-of-the-art\\nmodels. Our codes and examples is availble at:\\nhttps://github.com/WangTooNaive/MozartsTouch\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.02801\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

近年来,人工智能生成内容(AI-Generated Content,AIGC)发展迅速,促进了各行各业音乐、图像和其他艺术表现形式的生成。然而,关于通用多模态音乐生成模型的研究仍然很少。为了填补这一空白,我们提出了多模态音乐生成框架 "莫扎特的触摸"。莫扎特之触由三个主要部分组成:Mozart's Touch 由三个主要部分组成:多模态字幕模块(Multi-modal CaptioningModule)、大型语言模型(LLM)理解与桥接模块(Large Language Model Understanding & Bridging Module)和音乐生成模块(MusicGeneration Module)。与传统方法不同,莫扎特之音不需要对预先训练好的模型进行训练或微调,而是通过清晰、可解释的提示来提供效率和透明度。我们还引入了 "LLM-Bridge "方法,以解决不同模式的描述性文本之间的异构表示问题。我们对所提出的模型进行了一系列客观和主观评估,结果表明我们的模型超越了当前最先进模型的性能。我们的代码和示例可在以下网址获取:https://github.com/WangTooNaive/MozartsTouch
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models
In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models. Our codes and examples is availble at: https://github.com/WangTooNaive/MozartsTouch
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信