VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

arXiv - CS - Sound Pub Date : 2024-06-06 DOI:arxiv-2406.04321

Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

引用次数: 0

Abstract

In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 190K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets will be available at https://github.com/ZeyueT/VidMuse/.

查看原文本刊更多论文

VidMuse：具有长期短期建模功能的简单视频音乐生成框架

在这项工作中，我们系统地研究了仅以视频为条件的音乐生成。首先，我们提供了一个包含 190K 对视频音乐的大型数据集，其中包括电影预告片、广告和纪录片等各种类型。此外，我们还提出了 VidMuse，这是一个根据视频输入生成音乐的简单框架。VidMuse 的突出之处在于它能生成与视频在声学和语义上都一致的高保真音乐。VidMuse 结合了局部和全局视觉线索，通过长短期建模，能够创建与视频内容一致的音乐连贯音轨。通过广泛的实验，VidMuse 在音频质量、多样性和音视频一致性方面都优于现有模型。有关代码和数据集可从以下网址获取：https://github.com/ZeyueT/VidMuse/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量