Interspeech最新文献_第4页

Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages 自下而上发现不同语言中响应令牌（“后台通道”）的结构和变化

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11288

Andreas Liesenfeld, Mark Dingemanse

引用次数: 3

FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion FlowCPCVC：一种用于任意语音转换的对比预测编码监督流框架

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-577

Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang

{"title":"FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion","authors":"Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang","doi":"10.21437/interspeech.2022-577","DOIUrl":"https://doi.org/10.21437/interspeech.2022-577","url":null,"abstract":"Recently, the research of any-to-any voice conversion(VC) has been developed rapidly. However, they often suffer from unsat-isfactory quality and require two stages for training, in which a spectrum generation process is indispensable. In this paper, we propose the FlowCPCVC system, which results in higher speech naturalness and timbre similarity. FlowCPCVC is the first one-stage training system for any-to-any task in our knowledge by taking advantage of VAE and contrastive learning. We employ a speaker encoder to extract timbre information, and a contrastive predictive coding(CPC) based content extractor to guide the flow module to discard the timbre and keeping the linguistic information. Our method directly incorporates the vocoder into the training, thus avoiding the loss of spectral information as in two-stage training. With a fancy method in training any-to-any task, we can also get robust results when using it in any-to-many conversion. Experiments show that FlowCPCVC achieves obvious improvement when compared to VQMIVC which is current state-of-the-art any-to-any voice conversion system. Our demo is available online 1 .","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2558-2562"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48445659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Reducing Domain mismatch in Self-supervised speech pre-training 减少自监督语音预训练中的域不匹配

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-736

M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano

{"title":"Reducing Domain mismatch in Self-supervised speech pre-training","authors":"M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano","doi":"10.21437/interspeech.2022-736","DOIUrl":"https://doi.org/10.21437/interspeech.2022-736","url":null,"abstract":"Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on speciﬁc samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples by performing a ﬁne-grained data selection. ATM performs masking over the highly conﬁdent input frames as chosen by the scorer. This allows the model to learn meaningful representations. We conduct ﬁne-tuning experiments on two well-benchmarked cor-pora: LibriSpeech (matching the pre-training data) and, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efﬁcacy of ATM on signiﬁcantly improving the recognition performance under mismatched conditions while still yielding modest improvements under matched conditions.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3028-3032"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48701115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method 语义差分法对穿透性语音的感知评价

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-100

T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano

引用次数: 0

CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis 用于表达文本到语音合成的约束跨模态说话风格建模

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11275

Yi Meng, Xiang Li, Zhiyong Wu, Tingtian Li, Zixun Sun, Xinyu Xiao, Chi Sun, Hui Zhan, H. Meng

引用次数: 4

Spatial-aware Speaker Diarization for Multi-channel Multi-party Meeting 基于空间感知的多渠道多方会议发言人日记

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/Interspeech.2022-11412

Jie Wang, Yuji Liu, Binling Wang, Yiming Zhi, Song Li, Shipeng Xia, Jiayang Zhang, Feng Tong, Lin Li, Q. Hong

引用次数: 6

Online Learning of Open-set Speaker Identification by Active User-registration 基于主动用户注册的开放集说话人识别在线学习

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-25

Eunkyung Yoo, H. Song, Taehyeong Kim, Chul Lee

引用次数: 1

MSDWild: Multi-modal Speaker Diarization Dataset in the Wild MSDWild:狂野中的多模态说话人日记数据集

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10466

Tao Liu, Shuai Fan, Xu Xiang, Hongbo Song, Shaoxiong Lin, Jiaqi Sun, Tianyuan Han, Siyuan Chen, Binwei Yao, Sen Liu, Yifei Wu, Y. Qian, Kai Yu

引用次数: 6

Zero-Shot Foreign Accent Conversion without a Native Reference 没有本机引用的零样本外来重音转换

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10664

Waris Quamer, Anurag Das, John M. Levis, E. Chukharev-Hudilainen, R. Gutierrez-Osuna

{"title":"Zero-Shot Foreign Accent Conversion without a Native Reference","authors":"Waris Quamer, Anurag Das, John M. Levis, E. Chukharev-Hudilainen, R. Gutierrez-Osuna","doi":"10.21437/interspeech.2022-10664","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10664","url":null,"abstract":"Previous approaches for foreign accent conversion (FAC) ei-ther need a reference utterance from a native speaker (L1) during synthesis, or are dedicated one-to-one systems that must be trained separately for each non-native (L2) speaker. To address both issues, we propose a new FAC system that can transform L2 speech directly from previously unseen speakers. The system consists of two independent modules: a translator and a synthesizer, which operate on bottleneck features derived from phonetic posteriorgrams. The translator is trained to map bottleneck features in L2 utterances into those from a parallel L1 utterance. The synthesizer is a many-to-many system that maps input bottleneck features into the corresponding Mel-spectrograms, conditioned on an embedding from the L2 speaker. During inference, both modules operate in sequence to take an unseen L2 utterance and generate a native-accented Mel-spectrogram. Perceptual experiments show that our system achieves a large reduction (67%) in non-native accentedness compared to a state-of-the-art reference-free system (28.9%) that builds a dedicated model for each L2 speaker. Moreover, 80% of the listeners rated the synthesized utterances to have the same voice identity as the L2 speaker.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4920-4924"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43009987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Incremental learning for RNN-Transducer based speech recognition models 基于RNN传感器的语音识别模型的增量学习

Interspeech Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10795

Deepak Baby, Pasquale D’Alterio, Valentin Mendelev

{"title":"Incremental learning for RNN-Transducer based speech recognition models","authors":"Deepak Baby, Pasquale D’Alterio, Valentin Mendelev","doi":"10.21437/interspeech.2022-10795","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10795","url":null,"abstract":"This paper investigates an incremental learning framework for a real-world voice assistant employing RNN-Transducer based automatic speech recognition (ASR) model. Such a model needs to be regularly updated to keep up with changing distribution of customer requests. We demonstrate that a simple ﬁne-tuning approach with a combination of old and new training data can be used to incrementally update the model spending only several hours of training time and without any degradation on old data. This paper explores multiple rounds of incremental updates on the ASR model with monthly training data. Results show that the proposed approach achieves 5-6% relative WER improvement over the models trained from scratch on the monthly evaluation datasets. In addition, we explore if it is pos-sible to improve recognition of speciﬁc new words. We simulate multiple rounds of incremental updates with handful of training utterances per word (both real and synthetic) and show that the recognition of the new words improves dramatically but with a minor degradation on general data. Finally, we demonstrate that the observed degradation on general data can be mitigated by interleaving monthly updates with updates targeting speciﬁc words.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"71-75"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47633462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5