D. Pham, Quang-Anh Do, Thanh Thi Hien Duong, Thi-Lan Le, Phi-Le Nguyen
{"title":"End-to-end Visual-guided Audio Source Separation with Enhanced Losses","authors":"D. Pham, Quang-Anh Do, Thanh Thi Hien Duong, Thi-Lan Le, Phi-Le Nguyen","doi":"10.23919/APSIPAASC55919.2022.9980162","DOIUrl":null,"url":null,"abstract":"Visual-guided Audio Source Separation (VASS) refers to separating individual sound sources from an audio mixture of multiple simultaneous sound sources by using additional visual features that guide the separation process. For the VASS task, visual features and the correlation of audio and visual play an important role, based on which we manage to estimate better audio masks to improve the separation performance. In this paper, we propose an approach to jointly train the components of a cross-modal retrieval framework with video data and enable the network to find more optimal features. Such end-to-end framework is trained with three loss functions: 1) separation loss to limit the separated magnitude spectrogram discrepancy, 2) object-consistency loss to enforce the consistency of the separated audio with the visual information, and 3) cross-modal loss to maximize the correlation of audio and its corresponding visual sounding object while also maximize the difference between the audio and visual information of different objects. The proposed VASS model was evaluated on the benchmark dataset MUSIC, which contains a large number of videos of people playing instruments in different combinations. Experiment results confirmed the advantages of our model over previous VASS models.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"93 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9980162","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Visual-guided Audio Source Separation (VASS) refers to separating individual sound sources from an audio mixture of multiple simultaneous sound sources by using additional visual features that guide the separation process. For the VASS task, visual features and the correlation of audio and visual play an important role, based on which we manage to estimate better audio masks to improve the separation performance. In this paper, we propose an approach to jointly train the components of a cross-modal retrieval framework with video data and enable the network to find more optimal features. Such end-to-end framework is trained with three loss functions: 1) separation loss to limit the separated magnitude spectrogram discrepancy, 2) object-consistency loss to enforce the consistency of the separated audio with the visual information, and 3) cross-modal loss to maximize the correlation of audio and its corresponding visual sounding object while also maximize the difference between the audio and visual information of different objects. The proposed VASS model was evaluated on the benchmark dataset MUSIC, which contains a large number of videos of people playing instruments in different combinations. Experiment results confirmed the advantages of our model over previous VASS models.