{"title":"Neural Mask based Multi-channel Convolutional Beamforming for Joint Dereverberation, Echo Cancellation and Denoising","authors":"Jianming Liu, Meng Yu, Yong Xu, Chao Weng, Shi-Xiong Zhang, Lianwu Chen, Dong Yu","doi":"10.1109/SLT48900.2021.9383519","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383519","url":null,"abstract":"This paper proposes a new joint optimization framework for simultaneous dereverberation, acoustic echo cancellation, and denoising, which is motivated by the recently proposed con-volutional beamformer for simultaneous denoising and dereverberation. Using the echo aware mask based beamforming framework, the proposed algorithm could effectively deal with double-talk case and local inference, etc. The evaluations based on ERLE for echo only, and PESQ for double-talk demonstrate that the proposed algorithm could significantly improve the performance.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115142796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hiroto Kai, Shinnosuke Takamichi, Sayaka Shiota, H. Kiya
{"title":"Lightweight Voice Anonymization Based on Data-Driven Optimization of Cascaded Voice Modification Modules","authors":"Hiroto Kai, Shinnosuke Takamichi, Sayaka Shiota, H. Kiya","doi":"10.1109/SLT48900.2021.9383535","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383535","url":null,"abstract":"In this paper, we propose a voice anonymization framework based on data-driven optimization of cascaded voice modification modules. With increasing opportunities to use speech dialogue with machines nowadays, research regarding privacy protection of speaker information encapsulated in speech data is attracting attention. Anonymization, which is one of the methods for privacy protection, is based on signal processing manners, and the other one based on machine learning ones. Both approaches have a trade off between intelligibility of speech and degree of anonymization. The proposed voice anonymization framework utilizes advantages of machine learning and signal processing-based approaches to find the optimized trade off between the two. We use signal processing methods with training data for optimizing hyperparameters in a data-driven manner. The speech is modified using cascaded lightweight signal processing methods and then evaluated using black-box ASR and ASV, respectively. Our proposed method succeeded in deteriorating the speaker recognition rate by approximately 22% while simultaneously improved the speech recognition rate by over 3% compared to a signal processing-based conventional method.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115611591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
You Jin Kim, Hee-Soo Heo, Soo-Whan Chung, Bong-Jin Lee
{"title":"End-To-End Lip Synchronisation Based on Pattern Classification","authors":"You Jin Kim, Hee-Soo Heo, Soo-Whan Chung, Bong-Jin Lee","doi":"10.1109/SLT48900.2021.9383616","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383616","url":null,"abstract":"The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the task. To this end, we propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream. The similarity matrix between the two modalities is first computed from the features, then the inference of the offset can be considered to be a pattern recognition problem where the matrix is considered equivalent to an image. The feature extractor and the classifier are trained jointly. We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125841489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hiroshi Sato, Tsubasa Ochiai, K. Kinoshita, Marc Delcroix, T. Nakatani, S. Araki
{"title":"Multimodal Attention Fusion for Target Speaker Extraction","authors":"Hiroshi Sato, Tsubasa Ochiai, K. Kinoshita, Marc Delcroix, T. Nakatani, S. Araki","doi":"10.1109/SLT48900.2021.9383539","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383539","url":null,"abstract":"Target speaker extraction, which aims at extracting a target speaker’s voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than single modality methods for simulated data, its adaptation towards realistic situations has not been fully explored as well as evaluations on real recorded mixtures. One of the major issues to handle realistic situations is how to make the system robust to clue corruption because in real recordings both clues may not be equally reliable, e.g. visual clues may be affected by occlusions. In this work, we propose a novel attention mechanism for multi-modal fusion and its training methods that enable to effectively capture the reliability of the clues and weight the more reliable ones. Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data. Moreover, we also record an audio-visual dataset of simultaneous speech with realistic visual clue corruption and show that audio-visual target speaker extraction with our proposals successfully work on real data.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133777402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient corpus design for wake-word detection","authors":"Delowar Hossain, Yoshinao Sato","doi":"10.1109/SLT48900.2021.9383569","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383569","url":null,"abstract":"Wake-word detection is an indispensable technology for preventing virtual voice agents from being unintentionally triggered. Although various neural networks were proposed for wake-word detection, less attention has been paid to efficient corpus design, which we address in this study. For this purpose, we collected speech data via a crowdsourcing platform and evaluated the performance of several neural networks when different subsets of the corpus were used for training. The results reveal the following requirements for efficient corpus design to produce a lower misdetection rate: (1) short segments of continuous speech can be used as negative samples, but they are not as effective as random words; (2) utterances of \"adversarial\" words, i.e., phonetically similar words to a wake-word, contribute to improving performance significantly when they are used as negative samples; (3) it is preferable for individual speakers to provide both positive and negative samples; (4) increasing the number of speakers is better than increasing the number of repetitions of a wake-word by each speaker.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122214371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Timothy Israel Santos, Andrew Abel, N. Wilson, Yan Xu
{"title":"Speaker-Independent Visual Speech Recognition with the Inception V3 Model","authors":"Timothy Israel Santos, Andrew Abel, N. Wilson, Yan Xu","doi":"10.1109/SLT48900.2021.9383540","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383540","url":null,"abstract":"The natural process of understanding speech involves combining auditory and visual cues. CNN based lip reading systems have become very popular in recent years. However, many of these systems consider lipreading to be a black box problem, with limited detailed performance analysis. In this paper, we performed transfer learning by training the Inception v3 CNN model, which has pre-trained weights produced from IMAGENET, with the GRID corpus, delivering good speech recognition results, with 0.61 precision, 0.53 recall, and 0.51 F1-score. The lip reading model was able to automatically learn pertinent features, demonstrated using visualisation, and achieve speaker-independent results comparable to human lip readers on the GRID corpus. We also identify limitations that match those of humans, therefore limiting potential deep learning performance in real world situations.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115077393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}