{"title":"基于多模态卷积神经网络的两阶段音视频语音分离","authors":"Yang Xian, Yang Sun, Wenwu Wang, S. M. Naqvi","doi":"10.1109/SSPD.2019.8751656","DOIUrl":null,"url":null,"abstract":"The performance of the audio-only neural networks based monaural speech separation methods is still limited, particularly when multiple-speakers are active. The very recent method [1] used the audio-video (AV) model to find the non-linear relationship between the noisy mixture and the desired speech signal. However, the over-fitting problem always happens when the AV model is trained. Hence, the separation performance is limited. To address this limitation, we propose a system with two sequentially trained AV models to separate the desired speech signal. In the proposed system, after the first AV model is trained, its output is used to calculate the training target of the second AV model, which is exploited to further improve the separation performance. The GRID audiovisual sentence corpus is used to generate the training and testing datasets. The signal to distortion ratio (SDR) and short-time objective intelligibility (STOI) proved the proposed system outperforms the state-of-the-art method.","PeriodicalId":281127,"journal":{"name":"2019 Sensor Signal Processing for Defence Conference (SSPD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Two Stage Audio-Video Speech Separation using Multimodal Convolutional Neural Networks\",\"authors\":\"Yang Xian, Yang Sun, Wenwu Wang, S. M. Naqvi\",\"doi\":\"10.1109/SSPD.2019.8751656\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The performance of the audio-only neural networks based monaural speech separation methods is still limited, particularly when multiple-speakers are active. The very recent method [1] used the audio-video (AV) model to find the non-linear relationship between the noisy mixture and the desired speech signal. However, the over-fitting problem always happens when the AV model is trained. Hence, the separation performance is limited. To address this limitation, we propose a system with two sequentially trained AV models to separate the desired speech signal. In the proposed system, after the first AV model is trained, its output is used to calculate the training target of the second AV model, which is exploited to further improve the separation performance. The GRID audiovisual sentence corpus is used to generate the training and testing datasets. The signal to distortion ratio (SDR) and short-time objective intelligibility (STOI) proved the proposed system outperforms the state-of-the-art method.\",\"PeriodicalId\":281127,\"journal\":{\"name\":\"2019 Sensor Signal Processing for Defence Conference (SSPD)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 Sensor Signal Processing for Defence Conference (SSPD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SSPD.2019.8751656\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Sensor Signal Processing for Defence Conference (SSPD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSPD.2019.8751656","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Two Stage Audio-Video Speech Separation using Multimodal Convolutional Neural Networks
The performance of the audio-only neural networks based monaural speech separation methods is still limited, particularly when multiple-speakers are active. The very recent method [1] used the audio-video (AV) model to find the non-linear relationship between the noisy mixture and the desired speech signal. However, the over-fitting problem always happens when the AV model is trained. Hence, the separation performance is limited. To address this limitation, we propose a system with two sequentially trained AV models to separate the desired speech signal. In the proposed system, after the first AV model is trained, its output is used to calculate the training target of the second AV model, which is exploited to further improve the separation performance. The GRID audiovisual sentence corpus is used to generate the training and testing datasets. The signal to distortion ratio (SDR) and short-time objective intelligibility (STOI) proved the proposed system outperforms the state-of-the-art method.