{"title":"Mask Optimisation for Neural Network Monaural Source Separation","authors":"R. Cant, C. Langensiepen, W. Metcalf","doi":"10.1109/UKSim.2017.21","DOIUrl":null,"url":null,"abstract":"An ideal binary mask is a means by which multiple sound sources within a single audio file can be separated. Previous work has shown a deep neural network can be trained to approximate the ideal mask, but at a substantial computational cost. We present a method to assess the impact of reducing the mask by averaging time and frequency bins, so that the computational cost can be significantly reduced. Our work uses the original separate musical channels mask as a ground truth and compares this against an ideal binary mask and an ideal ”soft” or proportional mask. The ideal soft mask is then compared against masks produced by a range of averaging levels. We find that averaging could produce a reduction by a factor of 16 in the number of weights in the neural network (and thus a significant improvement in computation time), while still achieving plausible results in terms of source separation.","PeriodicalId":309250,"journal":{"name":"2017 UKSim-AMSS 19th International Conference on Computer Modelling & Simulation (UKSim)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 UKSim-AMSS 19th International Conference on Computer Modelling & Simulation (UKSim)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UKSim.2017.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
An ideal binary mask is a means by which multiple sound sources within a single audio file can be separated. Previous work has shown a deep neural network can be trained to approximate the ideal mask, but at a substantial computational cost. We present a method to assess the impact of reducing the mask by averaging time and frequency bins, so that the computational cost can be significantly reduced. Our work uses the original separate musical channels mask as a ground truth and compares this against an ideal binary mask and an ideal ”soft” or proportional mask. The ideal soft mask is then compared against masks produced by a range of averaging levels. We find that averaging could produce a reduction by a factor of 16 in the number of weights in the neural network (and thus a significant improvement in computation time), while still achieving plausible results in terms of source separation.