{"title":"Video Echoed in Harmony: Learning and Sampling Video-Integrated Chord Progression Sequences for Controllable Video Background Music Generation","authors":"Xinyi Tong;Sitong Chen;Peiyang Yu;Nian Liu;Hui Qv;Tao Ma;Bo Zheng;Feng Yu;Song-Chun Zhu","doi":"10.1109/TCSS.2024.3451515","DOIUrl":null,"url":null,"abstract":"Automatically generating video background music mitigates the inefficiency and time-consuming drawbacks of current manual video editing. Two key challenges hinder the expansion of the inception of video-to-music tasks. 1) Limited availability of high-quality video–music datasets and annotations. 2) Absence of music generation methods that consider actual musicality, which are controlled by interpretable factors based on music theory. In the article, we propose video echoed in harmony (VEH), a method for learning and sampling video-integrated chord progression sequences. Our approach adopts harmony, represented by chord progressions that are aligned with various music formats [musical instrument digital interface (MIDI), audio, and score], imitating chord precedence in human music composition. Visual-language models link visual features to chord progressions through genre labels and descriptive words in generated textualized videos. The two aforementioned features collectively obviate the necessity of extensive video–music paired data. Besides, an energy-based chord progression learning and sampling algorithm quantifies abstract harmony impressions to statistical features, serving as interpretable factors for the controllable music generation based on music theory. Experimental results demonstrate that the proposed method outperforms the state-of-the-art, producing a superior music alignment for the given video.","PeriodicalId":13044,"journal":{"name":"IEEE Transactions on Computational Social Systems","volume":"12 2","pages":"905-917"},"PeriodicalIF":4.5000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computational Social Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10701611/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}
引用次数: 0
Abstract
Automatically generating video background music mitigates the inefficiency and time-consuming drawbacks of current manual video editing. Two key challenges hinder the expansion of the inception of video-to-music tasks. 1) Limited availability of high-quality video–music datasets and annotations. 2) Absence of music generation methods that consider actual musicality, which are controlled by interpretable factors based on music theory. In the article, we propose video echoed in harmony (VEH), a method for learning and sampling video-integrated chord progression sequences. Our approach adopts harmony, represented by chord progressions that are aligned with various music formats [musical instrument digital interface (MIDI), audio, and score], imitating chord precedence in human music composition. Visual-language models link visual features to chord progressions through genre labels and descriptive words in generated textualized videos. The two aforementioned features collectively obviate the necessity of extensive video–music paired data. Besides, an energy-based chord progression learning and sampling algorithm quantifies abstract harmony impressions to statistical features, serving as interpretable factors for the controllable music generation based on music theory. Experimental results demonstrate that the proposed method outperforms the state-of-the-art, producing a superior music alignment for the given video.
期刊介绍:
IEEE Transactions on Computational Social Systems focuses on such topics as modeling, simulation, analysis and understanding of social systems from the quantitative and/or computational perspective. "Systems" include man-man, man-machine and machine-machine organizations and adversarial situations as well as social media structures and their dynamics. More specifically, the proposed transactions publishes articles on modeling the dynamics of social systems, methodologies for incorporating and representing socio-cultural and behavioral aspects in computational modeling, analysis of social system behavior and structure, and paradigms for social systems modeling and simulation. The journal also features articles on social network dynamics, social intelligence and cognition, social systems design and architectures, socio-cultural modeling and representation, and computational behavior modeling, and their applications.