{"title":"语音/音乐分离使用非负矩阵分解与成本函数的组合","authors":"B. Nasersharif, S. Abdali","doi":"10.1109/AISP.2015.7123491","DOIUrl":null,"url":null,"abstract":"A solution for separating speech from music signal as a single channel source separation is Non-negative Matrix Factorization (NMF). In this approach spectrogram of each source signal is factorized as multiplication of two matrices which are known as basis and weight matrices. To achieve proper estimation of signal spectrogram, weight and basis matrices are updated iteratively. To estimate distance between signal and its estimation a cost function is used usually. Different cost functions have been introduced based on Kullback-Leibler (KL) and Itakura-Saito (IS) divergences. IS divergence is scale-invariant and so it is suitable for the conditions in which the coefficients of signal have a large dynamic range, for example in music short-term spectra. Based on this IS property, in this paper, we propose to use IS divergence as cost function of NMF in the training stage for music and on the other hand we suggest to use KL divergence as NMF cost function in the training stage for speech. Moreover, in the decomposition stage, we propose to use a linear combination of these two divergences in addition to a regularization term which considers temporal continuity information as a prior knowledge. Experimental results on one hour of speech and music, shows a good trade-off between signal to inference ratio (SIR) of speech and music in comparison to conventional NMF methods.","PeriodicalId":405857,"journal":{"name":"2015 The International Symposium on Artificial Intelligence and Signal Processing (AISP)","volume":"2004 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Speech/music separation using non-negative matrix factorization with combination of cost functions\",\"authors\":\"B. Nasersharif, S. Abdali\",\"doi\":\"10.1109/AISP.2015.7123491\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A solution for separating speech from music signal as a single channel source separation is Non-negative Matrix Factorization (NMF). In this approach spectrogram of each source signal is factorized as multiplication of two matrices which are known as basis and weight matrices. To achieve proper estimation of signal spectrogram, weight and basis matrices are updated iteratively. To estimate distance between signal and its estimation a cost function is used usually. Different cost functions have been introduced based on Kullback-Leibler (KL) and Itakura-Saito (IS) divergences. IS divergence is scale-invariant and so it is suitable for the conditions in which the coefficients of signal have a large dynamic range, for example in music short-term spectra. Based on this IS property, in this paper, we propose to use IS divergence as cost function of NMF in the training stage for music and on the other hand we suggest to use KL divergence as NMF cost function in the training stage for speech. Moreover, in the decomposition stage, we propose to use a linear combination of these two divergences in addition to a regularization term which considers temporal continuity information as a prior knowledge. Experimental results on one hour of speech and music, shows a good trade-off between signal to inference ratio (SIR) of speech and music in comparison to conventional NMF methods.\",\"PeriodicalId\":405857,\"journal\":{\"name\":\"2015 The International Symposium on Artificial Intelligence and Signal Processing (AISP)\",\"volume\":\"2004 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 The International Symposium on Artificial Intelligence and Signal Processing (AISP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AISP.2015.7123491\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 The International Symposium on Artificial Intelligence and Signal Processing (AISP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AISP.2015.7123491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Speech/music separation using non-negative matrix factorization with combination of cost functions
A solution for separating speech from music signal as a single channel source separation is Non-negative Matrix Factorization (NMF). In this approach spectrogram of each source signal is factorized as multiplication of two matrices which are known as basis and weight matrices. To achieve proper estimation of signal spectrogram, weight and basis matrices are updated iteratively. To estimate distance between signal and its estimation a cost function is used usually. Different cost functions have been introduced based on Kullback-Leibler (KL) and Itakura-Saito (IS) divergences. IS divergence is scale-invariant and so it is suitable for the conditions in which the coefficients of signal have a large dynamic range, for example in music short-term spectra. Based on this IS property, in this paper, we propose to use IS divergence as cost function of NMF in the training stage for music and on the other hand we suggest to use KL divergence as NMF cost function in the training stage for speech. Moreover, in the decomposition stage, we propose to use a linear combination of these two divergences in addition to a regularization term which considers temporal continuity information as a prior knowledge. Experimental results on one hour of speech and music, shows a good trade-off between signal to inference ratio (SIR) of speech and music in comparison to conventional NMF methods.