2021 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Neural Mask based Multi-channel Convolutional Beamforming for Joint Dereverberation, Echo Cancellation and Denoising 基于神经掩模的多通道卷积波束形成联合去噪、回波抵消和去噪
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383519
Jianming Liu, Meng Yu, Yong Xu, Chao Weng, Shi-Xiong Zhang, Lianwu Chen, Dong Yu
{"title":"Neural Mask based Multi-channel Convolutional Beamforming for Joint Dereverberation, Echo Cancellation and Denoising","authors":"Jianming Liu, Meng Yu, Yong Xu, Chao Weng, Shi-Xiong Zhang, Lianwu Chen, Dong Yu","doi":"10.1109/SLT48900.2021.9383519","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383519","url":null,"abstract":"This paper proposes a new joint optimization framework for simultaneous dereverberation, acoustic echo cancellation, and denoising, which is motivated by the recently proposed con-volutional beamformer for simultaneous denoising and dereverberation. Using the echo aware mask based beamforming framework, the proposed algorithm could effectively deal with double-talk case and local inference, etc. The evaluations based on ERLE for echo only, and PESQ for double-talk demonstrate that the proposed algorithm could significantly improve the performance.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115142796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Lightweight Voice Anonymization Based on Data-Driven Optimization of Cascaded Voice Modification Modules 基于级联语音修改模块数据驱动优化的轻量级语音匿名化
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383535
Hiroto Kai, Shinnosuke Takamichi, Sayaka Shiota, H. Kiya
{"title":"Lightweight Voice Anonymization Based on Data-Driven Optimization of Cascaded Voice Modification Modules","authors":"Hiroto Kai, Shinnosuke Takamichi, Sayaka Shiota, H. Kiya","doi":"10.1109/SLT48900.2021.9383535","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383535","url":null,"abstract":"In this paper, we propose a voice anonymization framework based on data-driven optimization of cascaded voice modification modules. With increasing opportunities to use speech dialogue with machines nowadays, research regarding privacy protection of speaker information encapsulated in speech data is attracting attention. Anonymization, which is one of the methods for privacy protection, is based on signal processing manners, and the other one based on machine learning ones. Both approaches have a trade off between intelligibility of speech and degree of anonymization. The proposed voice anonymization framework utilizes advantages of machine learning and signal processing-based approaches to find the optimized trade off between the two. We use signal processing methods with training data for optimizing hyperparameters in a data-driven manner. The speech is modified using cascaded lightweight signal processing methods and then evaluated using black-box ASR and ASV, respectively. Our proposed method succeeded in deteriorating the speaker recognition rate by approximately 22% while simultaneously improved the speech recognition rate by over 3% compared to a signal processing-based conventional method.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115611591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
SLT 2021 Table of Contents SLT 2021目录
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/slt48900.2021.9383586
{"title":"SLT 2021 Table of Contents","authors":"","doi":"10.1109/slt48900.2021.9383586","DOIUrl":"https://doi.org/10.1109/slt48900.2021.9383586","url":null,"abstract":"","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125697087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-To-End Lip Synchronisation Based on Pattern Classification 基于模式分类的端到端唇同步
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383616
You Jin Kim, Hee-Soo Heo, Soo-Whan Chung, Bong-Jin Lee
{"title":"End-To-End Lip Synchronisation Based on Pattern Classification","authors":"You Jin Kim, Hee-Soo Heo, Soo-Whan Chung, Bong-Jin Lee","doi":"10.1109/SLT48900.2021.9383616","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383616","url":null,"abstract":"The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the task. To this end, we propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream. The similarity matrix between the two modalities is first computed from the features, then the inference of the offset can be considered to be a pattern recognition problem where the matrix is considered equivalent to an image. The feature extractor and the classifier are trained jointly. We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125841489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
SLT 2021 Cover Page SLT 2021封面
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/slt48900.2021.9383520
{"title":"SLT 2021 Cover Page","authors":"","doi":"10.1109/slt48900.2021.9383520","DOIUrl":"https://doi.org/10.1109/slt48900.2021.9383520","url":null,"abstract":"","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122347930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multimodal Attention Fusion for Target Speaker Extraction 基于多模态注意力融合的目标说话人提取
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383539
Hiroshi Sato, Tsubasa Ochiai, K. Kinoshita, Marc Delcroix, T. Nakatani, S. Araki
{"title":"Multimodal Attention Fusion for Target Speaker Extraction","authors":"Hiroshi Sato, Tsubasa Ochiai, K. Kinoshita, Marc Delcroix, T. Nakatani, S. Araki","doi":"10.1109/SLT48900.2021.9383539","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383539","url":null,"abstract":"Target speaker extraction, which aims at extracting a target speaker’s voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than single modality methods for simulated data, its adaptation towards realistic situations has not been fully explored as well as evaluations on real recorded mixtures. One of the major issues to handle realistic situations is how to make the system robust to clue corruption because in real recordings both clues may not be equally reliable, e.g. visual clues may be affected by occlusions. In this work, we propose a novel attention mechanism for multi-modal fusion and its training methods that enable to effectively capture the reliability of the clues and weight the more reliable ones. Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data. Moreover, we also record an audio-visual dataset of simultaneous speech with realistic visual clue corruption and show that audio-visual target speaker extraction with our proposals successfully work on real data.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133777402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
SLT 2021 Organizing Committee SLT 2021组委会
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/slt48900.2021.9383566
{"title":"SLT 2021 Organizing Committee","authors":"","doi":"10.1109/slt48900.2021.9383566","DOIUrl":"https://doi.org/10.1109/slt48900.2021.9383566","url":null,"abstract":"","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133887230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient corpus design for wake-word detection 高效的唤醒词检测语料库设计
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383569
Delowar Hossain, Yoshinao Sato
{"title":"Efficient corpus design for wake-word detection","authors":"Delowar Hossain, Yoshinao Sato","doi":"10.1109/SLT48900.2021.9383569","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383569","url":null,"abstract":"Wake-word detection is an indispensable technology for preventing virtual voice agents from being unintentionally triggered. Although various neural networks were proposed for wake-word detection, less attention has been paid to efficient corpus design, which we address in this study. For this purpose, we collected speech data via a crowdsourcing platform and evaluated the performance of several neural networks when different subsets of the corpus were used for training. The results reveal the following requirements for efficient corpus design to produce a lower misdetection rate: (1) short segments of continuous speech can be used as negative samples, but they are not as effective as random words; (2) utterances of \"adversarial\" words, i.e., phonetically similar words to a wake-word, contribute to improving performance significantly when they are used as negative samples; (3) it is preferable for individual speakers to provide both positive and negative samples; (4) increasing the number of speakers is better than increasing the number of repetitions of a wake-word by each speaker.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122214371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speaker-Independent Visual Speech Recognition with the Inception V3 Model 基于Inception V3模型的说话人独立视觉语音识别
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383540
Timothy Israel Santos, Andrew Abel, N. Wilson, Yan Xu
{"title":"Speaker-Independent Visual Speech Recognition with the Inception V3 Model","authors":"Timothy Israel Santos, Andrew Abel, N. Wilson, Yan Xu","doi":"10.1109/SLT48900.2021.9383540","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383540","url":null,"abstract":"The natural process of understanding speech involves combining auditory and visual cues. CNN based lip reading systems have become very popular in recent years. However, many of these systems consider lipreading to be a black box problem, with limited detailed performance analysis. In this paper, we performed transfer learning by training the Inception v3 CNN model, which has pre-trained weights produced from IMAGENET, with the GRID corpus, delivering good speech recognition results, with 0.61 precision, 0.53 recall, and 0.51 F1-score. The lip reading model was able to automatically learn pertinent features, demonstrated using visualisation, and achieve speaker-independent results comparable to human lip readers on the GRID corpus. We also identify limitations that match those of humans, therefore limiting potential deep learning performance in real world situations.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115077393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SLT 2021 Title Page SLT 2021标题页
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/slt48900.2021.9383601
{"title":"SLT 2021 Title Page","authors":"","doi":"10.1109/slt48900.2021.9383601","DOIUrl":"https://doi.org/10.1109/slt48900.2021.9383601","url":null,"abstract":"","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125250579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信