Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang
{"title":"VMAS:通过网络音乐视频中的语义对齐实现视频到音乐的生成","authors":"Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang","doi":"arxiv-2409.07450","DOIUrl":null,"url":null,"abstract":"We present a framework for learning to generate background music from video\ninputs. Unlike existing works that rely on symbolic musical annotations, which\nare limited in quantity and diversity, our method leverages large-scale web\nvideos accompanied by background music. This enables our model to learn to\ngenerate realistic and diverse music. To accomplish this goal, we develop a\ngenerative video-music Transformer with a novel semantic video-music alignment\nscheme. Our model uses a joint autoregressive and contrastive learning\nobjective, which encourages the generation of music aligned with high-level\nvideo content. We also introduce a novel video-beat alignment scheme to match\nthe generated music beats with the low-level motions in the video. Lastly, to\ncapture fine-grained visual cues in a video needed for realistic background\nmusic generation, we introduce a new temporal video encoder architecture,\nallowing us to efficiently process videos consisting of many densely sampled\nframes. We train our framework on our newly curated DISCO-MV dataset,\nconsisting of 2.2M video-music samples, which is orders of magnitude larger\nthan any prior datasets used for video music generation. Our method outperforms\nexisting approaches on the DISCO-MV and MusicCaps datasets according to various\nmusic generation evaluation metrics, including human evaluation. Results are\navailable at https://genjib.github.io/project_page/VMAs/index.html","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos\",\"authors\":\"Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang\",\"doi\":\"arxiv-2409.07450\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a framework for learning to generate background music from video\\ninputs. Unlike existing works that rely on symbolic musical annotations, which\\nare limited in quantity and diversity, our method leverages large-scale web\\nvideos accompanied by background music. This enables our model to learn to\\ngenerate realistic and diverse music. To accomplish this goal, we develop a\\ngenerative video-music Transformer with a novel semantic video-music alignment\\nscheme. Our model uses a joint autoregressive and contrastive learning\\nobjective, which encourages the generation of music aligned with high-level\\nvideo content. We also introduce a novel video-beat alignment scheme to match\\nthe generated music beats with the low-level motions in the video. Lastly, to\\ncapture fine-grained visual cues in a video needed for realistic background\\nmusic generation, we introduce a new temporal video encoder architecture,\\nallowing us to efficiently process videos consisting of many densely sampled\\nframes. We train our framework on our newly curated DISCO-MV dataset,\\nconsisting of 2.2M video-music samples, which is orders of magnitude larger\\nthan any prior datasets used for video music generation. Our method outperforms\\nexisting approaches on the DISCO-MV and MusicCaps datasets according to various\\nmusic generation evaluation metrics, including human evaluation. Results are\\navailable at https://genjib.github.io/project_page/VMAs/index.html\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"37 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07450\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07450","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
We present a framework for learning to generate background music from video
inputs. Unlike existing works that rely on symbolic musical annotations, which
are limited in quantity and diversity, our method leverages large-scale web
videos accompanied by background music. This enables our model to learn to
generate realistic and diverse music. To accomplish this goal, we develop a
generative video-music Transformer with a novel semantic video-music alignment
scheme. Our model uses a joint autoregressive and contrastive learning
objective, which encourages the generation of music aligned with high-level
video content. We also introduce a novel video-beat alignment scheme to match
the generated music beats with the low-level motions in the video. Lastly, to
capture fine-grained visual cues in a video needed for realistic background
music generation, we introduce a new temporal video encoder architecture,
allowing us to efficiently process videos consisting of many densely sampled
frames. We train our framework on our newly curated DISCO-MV dataset,
consisting of 2.2M video-music samples, which is orders of magnitude larger
than any prior datasets used for video music generation. Our method outperforms
existing approaches on the DISCO-MV and MusicCaps datasets according to various
music generation evaluation metrics, including human evaluation. Results are
available at https://genjib.github.io/project_page/VMAs/index.html