{"title":"Extracting Specific Voice from Mixed Audio Source","authors":"Kunihiko Sato","doi":"10.1109/AIVR46125.2019.00039","DOIUrl":null,"url":null,"abstract":"We propose auditory diminished reality by a deep neural network (DNN) extracting a single speech signal from a mixture of sounds containing other speakers and background noise. To realize the proposed DNN, we introduce a new dataset comprised of multi-speakers and environment noises. We conduct evaluations for measuring the source separation quality of the DNN. Additionally, we compare the separation quality of models learned with different amounts of training data. As a result, we found there is no significant difference in the separation quality between 10 and 30 minutes of the target speaker's speech length for training data.","PeriodicalId":274566,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR)","volume":"133 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AIVR46125.2019.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We propose auditory diminished reality by a deep neural network (DNN) extracting a single speech signal from a mixture of sounds containing other speakers and background noise. To realize the proposed DNN, we introduce a new dataset comprised of multi-speakers and environment noises. We conduct evaluations for measuring the source separation quality of the DNN. Additionally, we compare the separation quality of models learned with different amounts of training data. As a result, we found there is no significant difference in the separation quality between 10 and 30 minutes of the target speaker's speech length for training data.