VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI:arxiv-2409.07450

Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang

{"title":"VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos","authors":"Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang","doi":"arxiv-2409.07450","DOIUrl":null,"url":null,"abstract":"We present a framework for learning to generate background music from video\ninputs. Unlike existing works that rely on symbolic musical annotations, which\nare limited in quantity and diversity, our method leverages large-scale web\nvideos accompanied by background music. This enables our model to learn to\ngenerate realistic and diverse music. To accomplish this goal, we develop a\ngenerative video-music Transformer with a novel semantic video-music alignment\nscheme. Our model uses a joint autoregressive and contrastive learning\nobjective, which encourages the generation of music aligned with high-level\nvideo content. We also introduce a novel video-beat alignment scheme to match\nthe generated music beats with the low-level motions in the video. Lastly, to\ncapture fine-grained visual cues in a video needed for realistic background\nmusic generation, we introduce a new temporal video encoder architecture,\nallowing us to efficiently process videos consisting of many densely sampled\nframes. We train our framework on our newly curated DISCO-MV dataset,\nconsisting of 2.2M video-music samples, which is orders of magnitude larger\nthan any prior datasets used for video music generation. Our method outperforms\nexisting approaches on the DISCO-MV and MusicCaps datasets according to various\nmusic generation evaluation metrics, including human evaluation. Results are\navailable at https://genjib.github.io/project_page/VMAs/index.html","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07450","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder architecture, allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset, consisting of 2.2M video-music samples, which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation. Results are available at https://genjib.github.io/project_page/VMAs/index.html

查看原文本刊更多论文

VMAS：通过网络音乐视频中的语义对齐实现视频到音乐的生成

我们提出了一个从视频输入中学习生成背景音乐的框架。与依赖数量和多样性都有限的符号音乐注释的现有工作不同，我们的方法利用了伴有背景音乐的大规模网络视频。这使我们的模型能够学习生成逼真、多样的音乐。为了实现这一目标，我们开发了具有新颖语义视频-音乐配准方案的生成视频-音乐转换器。我们的模型采用联合自回归和对比学习目标，鼓励生成与高水平视频内容相匹配的音乐。我们还引入了一种新颖的视频-节拍配准方案，使生成的音乐节拍与视频中的低级动作相匹配。最后，为了捕捉视频中逼真的背景音乐生成所需的细粒度视觉线索，我们引入了一种新的时序视频编码器架构，使我们能够高效处理由许多密集采样帧组成的视频。我们在新策划的 DISCO-MV 数据集上训练我们的框架，该数据集由 220 万个视频音乐样本组成，比之前用于视频音乐生成的任何数据集都要大得多。根据各种音乐生成评估指标（包括人工评估），我们的方法在 DISCO-MV 和 MusicCaps 数据集上的表现优于现有方法。结果可在 https://genjib.github.io/project_page/VMAs/index.html 上查阅。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量