Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models

arXiv - CS - Sound Pub Date : 2024-05-05 DOI:arxiv-2405.02801

Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu

{"title":"Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models","authors":"Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu","doi":"arxiv-2405.02801","DOIUrl":null,"url":null,"abstract":"In recent years, AI-Generated Content (AIGC) has witnessed rapid\nadvancements, facilitating the generation of music, images, and other forms of\nartistic expression across various industries. However, researches on general\nmulti-modal music generation model remain scarce. To fill this gap, we propose\na multi-modal music generation framework Mozart's Touch. It could generate\naligned music with the cross-modality inputs, such as images, videos and text.\nMozart's Touch is composed of three main components: Multi-modal Captioning\nModule, Large Language Model (LLM) Understanding & Bridging Module, and Music\nGeneration Module. Unlike traditional approaches, Mozart's Touch requires no\ntraining or fine-tuning pre-trained models, offering efficiency and\ntransparency through clear, interpretable prompts. We also introduce\n\"LLM-Bridge\" method to resolve the heterogeneous representation problems\nbetween descriptive texts of different modalities. We conduct a series of\nobjective and subjective evaluations on the proposed model, and results\nindicate that our model surpasses the performance of current state-of-the-art\nmodels. Our codes and examples is availble at:\nhttps://github.com/WangTooNaive/MozartsTouch","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"47 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the generation of music, images, and other forms of artistic expression across various industries. However, researches on general multi-modal music generation model remain scarce. To fill this gap, we propose a multi-modal music generation framework Mozart's Touch. It could generate aligned music with the cross-modality inputs, such as images, videos and text. Mozart's Touch is composed of three main components: Multi-modal Captioning Module, Large Language Model (LLM) Understanding & Bridging Module, and Music Generation Module. Unlike traditional approaches, Mozart's Touch requires no training or fine-tuning pre-trained models, offering efficiency and transparency through clear, interpretable prompts. We also introduce "LLM-Bridge" method to resolve the heterogeneous representation problems between descriptive texts of different modalities. We conduct a series of objective and subjective evaluations on the proposed model, and results indicate that our model surpasses the performance of current state-of-the-art models. Our codes and examples is availble at: https://github.com/WangTooNaive/MozartsTouch

查看原文本刊更多论文

莫扎特的触摸基于预训练大型模型的轻量级多模态音乐生成框架

近年来，人工智能生成内容（AI-Generated Content，AIGC）发展迅速，促进了各行各业音乐、图像和其他艺术表现形式的生成。然而，关于通用多模态音乐生成模型的研究仍然很少。为了填补这一空白，我们提出了多模态音乐生成框架 "莫扎特的触摸"。莫扎特之触由三个主要部分组成：Mozart's Touch 由三个主要部分组成：多模态字幕模块（Multi-modal CaptioningModule）、大型语言模型（LLM）理解与桥接模块（Large Language Model Understanding & Bridging Module）和音乐生成模块（MusicGeneration Module）。与传统方法不同，莫扎特之音不需要对预先训练好的模型进行训练或微调，而是通过清晰、可解释的提示来提供效率和透明度。我们还引入了 "LLM-Bridge "方法，以解决不同模式的描述性文本之间的异构表示问题。我们对所提出的模型进行了一系列客观和主观评估，结果表明我们的模型超越了当前最先进模型的性能。我们的代码和示例可在以下网址获取：https://github.com/WangTooNaive/MozartsTouch

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量