Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu
{"title":"Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models","authors":"Tianze Xu, Jiajun Li, Xuesong Chen, Yinrui Yao, Shuchang Liu","doi":"arxiv-2405.02801","DOIUrl":null,"url":null,"abstract":"In recent years, AI-Generated Content (AIGC) has witnessed rapid\nadvancements, facilitating the generation of music, images, and other forms of\nartistic expression across various industries. However, researches on general\nmulti-modal music generation model remain scarce. To fill this gap, we propose\na multi-modal music generation framework Mozart's Touch. It could generate\naligned music with the cross-modality inputs, such as images, videos and text.\nMozart's Touch is composed of three main components: Multi-modal Captioning\nModule, Large Language Model (LLM) Understanding & Bridging Module, and Music\nGeneration Module. Unlike traditional approaches, Mozart's Touch requires no\ntraining or fine-tuning pre-trained models, offering efficiency and\ntransparency through clear, interpretable prompts. We also introduce\n\"LLM-Bridge\" method to resolve the heterogeneous representation problems\nbetween descriptive texts of different modalities. We conduct a series of\nobjective and subjective evaluations on the proposed model, and results\nindicate that our model surpasses the performance of current state-of-the-art\nmodels. Our codes and examples is availble at:\nhttps://github.com/WangTooNaive/MozartsTouch","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02801","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, AI-Generated Content (AIGC) has witnessed rapid
advancements, facilitating the generation of music, images, and other forms of
artistic expression across various industries. However, researches on general
multi-modal music generation model remain scarce. To fill this gap, we propose
a multi-modal music generation framework Mozart's Touch. It could generate
aligned music with the cross-modality inputs, such as images, videos and text.
Mozart's Touch is composed of three main components: Multi-modal Captioning
Module, Large Language Model (LLM) Understanding & Bridging Module, and Music
Generation Module. Unlike traditional approaches, Mozart's Touch requires no
training or fine-tuning pre-trained models, offering efficiency and
transparency through clear, interpretable prompts. We also introduce
"LLM-Bridge" method to resolve the heterogeneous representation problems
between descriptive texts of different modalities. We conduct a series of
objective and subjective evaluations on the proposed model, and results
indicate that our model surpasses the performance of current state-of-the-art
models. Our codes and examples is availble at:
https://github.com/WangTooNaive/MozartsTouch