End-to-end Visual-guided Audio Source Separation with Enhanced Losses

D. Pham, Quang-Anh Do, Thanh Thi Hien Duong, Thi-Lan Le, Phi-Le Nguyen
{"title":"End-to-end Visual-guided Audio Source Separation with Enhanced Losses","authors":"D. Pham, Quang-Anh Do, Thanh Thi Hien Duong, Thi-Lan Le, Phi-Le Nguyen","doi":"10.23919/APSIPAASC55919.2022.9980162","DOIUrl":null,"url":null,"abstract":"Visual-guided Audio Source Separation (VASS) refers to separating individual sound sources from an audio mixture of multiple simultaneous sound sources by using additional visual features that guide the separation process. For the VASS task, visual features and the correlation of audio and visual play an important role, based on which we manage to estimate better audio masks to improve the separation performance. In this paper, we propose an approach to jointly train the components of a cross-modal retrieval framework with video data and enable the network to find more optimal features. Such end-to-end framework is trained with three loss functions: 1) separation loss to limit the separated magnitude spectrogram discrepancy, 2) object-consistency loss to enforce the consistency of the separated audio with the visual information, and 3) cross-modal loss to maximize the correlation of audio and its corresponding visual sounding object while also maximize the difference between the audio and visual information of different objects. The proposed VASS model was evaluated on the benchmark dataset MUSIC, which contains a large number of videos of people playing instruments in different combinations. Experiment results confirmed the advantages of our model over previous VASS models.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"93 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9980162","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Visual-guided Audio Source Separation (VASS) refers to separating individual sound sources from an audio mixture of multiple simultaneous sound sources by using additional visual features that guide the separation process. For the VASS task, visual features and the correlation of audio and visual play an important role, based on which we manage to estimate better audio masks to improve the separation performance. In this paper, we propose an approach to jointly train the components of a cross-modal retrieval framework with video data and enable the network to find more optimal features. Such end-to-end framework is trained with three loss functions: 1) separation loss to limit the separated magnitude spectrogram discrepancy, 2) object-consistency loss to enforce the consistency of the separated audio with the visual information, and 3) cross-modal loss to maximize the correlation of audio and its corresponding visual sounding object while also maximize the difference between the audio and visual information of different objects. The proposed VASS model was evaluated on the benchmark dataset MUSIC, which contains a large number of videos of people playing instruments in different combinations. Experiment results confirmed the advantages of our model over previous VASS models.
端到端视觉引导音频源分离与增强的损失
视觉引导音频源分离(VASS)是指通过使用附加的视觉特征来指导分离过程,将单个声源从多个同时声源的音频混合中分离出来。对于VASS任务,视觉特征和音频与视觉的相关性起着重要作用,在此基础上,我们设法估计更好的音频掩码,以提高分离性能。在本文中,我们提出了一种与视频数据联合训练跨模态检索框架组件的方法,使网络能够找到更多的最优特征。这种端到端框架使用三个损失函数进行训练:1)分离损失,限制分离的幅度谱图差异;2)对象一致性损失,强制分离的音频与视觉信息的一致性;3)交叉模态损失,最大化音频与其对应的视觉探测对象的相关性,同时最大化不同对象的音频和视觉信息之间的差异。提出的VASS模型在基准数据集MUSIC上进行了评估,该数据集包含大量人们以不同组合演奏乐器的视频。实验结果证实了该模型相对于以往VASS模型的优越性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信