Fabio Vesperini, Paolo Vecchiotti, E. Principi, S. Squartini, F. Piazza
{"title":"用于多房间语音活动检测的深度神经网络:进展和比较评价","authors":"Fabio Vesperini, Paolo Vecchiotti, E. Principi, S. Squartini, F. Piazza","doi":"10.1109/IJCNN.2016.7727633","DOIUrl":null,"url":null,"abstract":"This paper focuses on Voice Activity Detectors (VAD) for multi-room domestic scenarios based on deep neural network architectures. Interesting advancements are observed with respect to a previous work. A comparative and extensive analysis is lead among four different neural networks (NN). In particular, we exploit Deep Belief Network (DBN), Multi-Layer Perceptron (MLP), Bidirectional Long Short-Term Memory recurrent neural network (BLSTM) and Convolutional Neural Network (CNN). The latter has recently encountered a large success in the computational audio processing field and it has been successfully employed in our task. Two home recorded datasets are used in order to approximate real-life scenarios. They contain audio files from several microphones arranged in various rooms, from whom six features are extracted and used as input for the deep neural classifiers. The output stage has been redesigned compared to the previous author's contribution, in order to take advantage of the networks discriminative ability. Our study is composed by a multi-stage analysis focusing on the selection of the features, the network size and the input microphones. Results are evaluated in terms of Speech Activity Detection error rate (SAD). As result, a best SAD equal to 5.8% and 2.6% is reached respectively in the two considered datasets. In addiction, a significant solidity in terms of microphone positioning is observed in the case of CNN.","PeriodicalId":109405,"journal":{"name":"2016 International Joint Conference on Neural Networks (IJCNN)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Deep neural networks for Multi-Room Voice Activity Detection: Advancements and comparative evaluation\",\"authors\":\"Fabio Vesperini, Paolo Vecchiotti, E. Principi, S. Squartini, F. Piazza\",\"doi\":\"10.1109/IJCNN.2016.7727633\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper focuses on Voice Activity Detectors (VAD) for multi-room domestic scenarios based on deep neural network architectures. Interesting advancements are observed with respect to a previous work. A comparative and extensive analysis is lead among four different neural networks (NN). In particular, we exploit Deep Belief Network (DBN), Multi-Layer Perceptron (MLP), Bidirectional Long Short-Term Memory recurrent neural network (BLSTM) and Convolutional Neural Network (CNN). The latter has recently encountered a large success in the computational audio processing field and it has been successfully employed in our task. Two home recorded datasets are used in order to approximate real-life scenarios. They contain audio files from several microphones arranged in various rooms, from whom six features are extracted and used as input for the deep neural classifiers. The output stage has been redesigned compared to the previous author's contribution, in order to take advantage of the networks discriminative ability. Our study is composed by a multi-stage analysis focusing on the selection of the features, the network size and the input microphones. Results are evaluated in terms of Speech Activity Detection error rate (SAD). As result, a best SAD equal to 5.8% and 2.6% is reached respectively in the two considered datasets. In addiction, a significant solidity in terms of microphone positioning is observed in the case of CNN.\",\"PeriodicalId\":109405,\"journal\":{\"name\":\"2016 International Joint Conference on Neural Networks (IJCNN)\",\"volume\":\"53 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 International Joint Conference on Neural Networks (IJCNN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IJCNN.2016.7727633\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN.2016.7727633","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Deep neural networks for Multi-Room Voice Activity Detection: Advancements and comparative evaluation
This paper focuses on Voice Activity Detectors (VAD) for multi-room domestic scenarios based on deep neural network architectures. Interesting advancements are observed with respect to a previous work. A comparative and extensive analysis is lead among four different neural networks (NN). In particular, we exploit Deep Belief Network (DBN), Multi-Layer Perceptron (MLP), Bidirectional Long Short-Term Memory recurrent neural network (BLSTM) and Convolutional Neural Network (CNN). The latter has recently encountered a large success in the computational audio processing field and it has been successfully employed in our task. Two home recorded datasets are used in order to approximate real-life scenarios. They contain audio files from several microphones arranged in various rooms, from whom six features are extracted and used as input for the deep neural classifiers. The output stage has been redesigned compared to the previous author's contribution, in order to take advantage of the networks discriminative ability. Our study is composed by a multi-stage analysis focusing on the selection of the features, the network size and the input microphones. Results are evaluated in terms of Speech Activity Detection error rate (SAD). As result, a best SAD equal to 5.8% and 2.6% is reached respectively in the two considered datasets. In addiction, a significant solidity in terms of microphone positioning is observed in the case of CNN.