{"title":"基于数据增强的零均值卷积网络的声级不变歌声分离","authors":"Kin Wah Edward Lin, Masataka Goto","doi":"10.1109/ICASSP.2019.8682958","DOIUrl":null,"url":null,"abstract":"We address an issue of separating singing voices from polyphonic music signals regardless of sound level variance of the mixture input. Using a standard separation quality assessment tool BSS Eval 4.0, we found that the separation quality of a singing voice separation (SVS) system based on a dilatable Convolutional Neural Network (CNN) decreases under different sound levels. Even if this SVS system is comparable to state-of-the-art SVS systems, it is vulnerable to the issue of sound level variance. We therefore investigate four methods of making the CNN-based SVS system invariant to different sound levels — two types of data augmentation, frame normalization, and zero-mean convolution. By testing all 15 combinations of the four methods, we found that all combinations can improve the sound level invariance and analyzed the best combinations. To the best of our knowledge, this is the first SVS work systematically investigating sound level variance.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"251-255"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Zero-mean Convolutional Network with Data Augmentation for Sound Level Invariant Singing Voice Separation\",\"authors\":\"Kin Wah Edward Lin, Masataka Goto\",\"doi\":\"10.1109/ICASSP.2019.8682958\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We address an issue of separating singing voices from polyphonic music signals regardless of sound level variance of the mixture input. Using a standard separation quality assessment tool BSS Eval 4.0, we found that the separation quality of a singing voice separation (SVS) system based on a dilatable Convolutional Neural Network (CNN) decreases under different sound levels. Even if this SVS system is comparable to state-of-the-art SVS systems, it is vulnerable to the issue of sound level variance. We therefore investigate four methods of making the CNN-based SVS system invariant to different sound levels — two types of data augmentation, frame normalization, and zero-mean convolution. By testing all 15 combinations of the four methods, we found that all combinations can improve the sound level invariance and analyzed the best combinations. To the best of our knowledge, this is the first SVS work systematically investigating sound level variance.\",\"PeriodicalId\":13203,\"journal\":{\"name\":\"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"1 1\",\"pages\":\"251-255\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2019.8682958\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2019.8682958","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Zero-mean Convolutional Network with Data Augmentation for Sound Level Invariant Singing Voice Separation
We address an issue of separating singing voices from polyphonic music signals regardless of sound level variance of the mixture input. Using a standard separation quality assessment tool BSS Eval 4.0, we found that the separation quality of a singing voice separation (SVS) system based on a dilatable Convolutional Neural Network (CNN) decreases under different sound levels. Even if this SVS system is comparable to state-of-the-art SVS systems, it is vulnerable to the issue of sound level variance. We therefore investigate four methods of making the CNN-based SVS system invariant to different sound levels — two types of data augmentation, frame normalization, and zero-mean convolution. By testing all 15 combinations of the four methods, we found that all combinations can improve the sound level invariance and analyzed the best combinations. To the best of our knowledge, this is the first SVS work systematically investigating sound level variance.