Jiang Huang, Xianglin Huang, Lifang Yang, Zhulin Tao
{"title":"由面部表情和舞蹈动作驱动的联合音乐生成 D2MNet","authors":"Jiang Huang, Xianglin Huang, Lifang Yang, Zhulin Tao","doi":"10.1016/j.array.2024.100348","DOIUrl":null,"url":null,"abstract":"<div><p>In general, dance is always associated with music to improve stage performance effect. As we know, artificial music arrangement consumes a lot of time and manpower. While automatic music arrangement based on input dance video perfectly solves this problem. In the cross-modal music generation task, we take advantage of the complementary information between two input modalities of facial expressions and dance movements. Then we present Dance2MusicNet (D2MNet), an autoregressive generation model based on dilated convolution, which adopts two feature vectors, dance style and beats, as control signals to generate real and diverse music that matches dance video. Finally, a comprehensive evaluation method for qualitative and quantitative experiment is proposed. Compared to baseline methods, D2MNet outperforms better in all evaluating metrics, which clearly demonstrates the effectiveness of our framework.</p></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"22 ","pages":"Article 100348"},"PeriodicalIF":2.3000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2590005624000146/pdfft?md5=57bcf00a132e600642ca5c16a65b9121&pid=1-s2.0-S2590005624000146-main.pdf","citationCount":"0","resultStr":"{\"title\":\"D2MNet for music generation joint driven by facial expressions and dance movements\",\"authors\":\"Jiang Huang, Xianglin Huang, Lifang Yang, Zhulin Tao\",\"doi\":\"10.1016/j.array.2024.100348\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In general, dance is always associated with music to improve stage performance effect. As we know, artificial music arrangement consumes a lot of time and manpower. While automatic music arrangement based on input dance video perfectly solves this problem. In the cross-modal music generation task, we take advantage of the complementary information between two input modalities of facial expressions and dance movements. Then we present Dance2MusicNet (D2MNet), an autoregressive generation model based on dilated convolution, which adopts two feature vectors, dance style and beats, as control signals to generate real and diverse music that matches dance video. Finally, a comprehensive evaluation method for qualitative and quantitative experiment is proposed. Compared to baseline methods, D2MNet outperforms better in all evaluating metrics, which clearly demonstrates the effectiveness of our framework.</p></div>\",\"PeriodicalId\":8417,\"journal\":{\"name\":\"Array\",\"volume\":\"22 \",\"pages\":\"Article 100348\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2024-05-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2590005624000146/pdfft?md5=57bcf00a132e600642ca5c16a65b9121&pid=1-s2.0-S2590005624000146-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Array\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590005624000146\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005624000146","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
D2MNet for music generation joint driven by facial expressions and dance movements
In general, dance is always associated with music to improve stage performance effect. As we know, artificial music arrangement consumes a lot of time and manpower. While automatic music arrangement based on input dance video perfectly solves this problem. In the cross-modal music generation task, we take advantage of the complementary information between two input modalities of facial expressions and dance movements. Then we present Dance2MusicNet (D2MNet), an autoregressive generation model based on dilated convolution, which adopts two feature vectors, dance style and beats, as control signals to generate real and diverse music that matches dance video. Finally, a comprehensive evaluation method for qualitative and quantitative experiment is proposed. Compared to baseline methods, D2MNet outperforms better in all evaluating metrics, which clearly demonstrates the effectiveness of our framework.