2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)最新文献_第7页

Accent and Speaker Disentanglement in Many-to-many Voice Conversion 多对多语音转换中的口音和说话人解纠缠

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-11-17 DOI: 10.1109/ISCSLP49672.2021.9362120

Zhichao Wang, Wenshuo Ge, Xiong Wang, Shan Yang, Wendong Gan, Haitao Chen, Hai Li, Lei Xie, Xiulin Li

{"title":"Accent and Speaker Disentanglement in Many-to-many Voice Conversion","authors":"Zhichao Wang, Wenshuo Ge, Xiong Wang, Shan Yang, Wendong Gan, Haitao Chen, Hai Li, Lei Xie, Xiulin Li","doi":"10.1109/ISCSLP49672.2021.9362120","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362120","url":null,"abstract":"This paper proposes an interesting voice and accent joint conversion approach, which can convert an arbitrary source speaker’s voice to a target speaker with non-native accent. This problem is challenging as each target speaker only has training data in native accent and we need to disentangle accent and speaker information in the conversion model training and re-combine them in the conversion stage. In our recognition-synthesis conversion framework, we manage to solve this problem by two proposed tricks. First, we use accent-dependent speech recognizers to obtain bottleneck features for different accented speakers. This aims to wipe out other factors beyond the linguistic information in the BN features for conversion model training. Second, we propose to use adversarial training to better disentangle the speaker and accent information in our encoder-decoder based conversion model. Specifically, we plug an auxiliary speaker classifier to the encoder, trained with an adversarial loss to wipe out speaker information from the encoder output. Experiments show that our approach is superior to the baseline. The proposed tricks are quite effective in improving accentedness and audio quality and speaker similarity are well maintained.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129212043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning 基于全数据学习的语音增强深度时滞神经网络

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-11-11 DOI: 10.1109/ISCSLP49672.2021.9362059

Cunhang Fan, B. Liu, J. Tao, Jiangyan Yi, Zhengqi Wen, Leichao Song

{"title":"Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning","authors":"Cunhang Fan, B. Liu, J. Tao, Jiangyan Yi, Zhengqi Wen, Leichao Song","doi":"10.1109/ISCSLP49672.2021.9362059","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362059","url":null,"abstract":"Recurrent neural networks (RNNs) have shown significant improvements in recent years for speech enhancement. However, the model complexity and inference time cost of RNNs are much higher than deep feed-forward neural networks (DNNs). Therefore, these limit the applications of speech enhancement. This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning. The TDNN has excellent potential for capturing long range temporal contexts, which utilizes a modular and incremental design. Besides, the TDNN preserves the feed-forward structure so that its inference cost is comparable to standard DNN. To make full use of the training data, we propose a full data learning method for speech enhancement. More specifically, we not only use the noisy-to-clean (input-to-target) to train the enhanced model, but also the clean-to-clean and noise-to-silence data. Therefore, all of the training data can be used to train the enhanced model. Our experiments are conducted on TIMIT dataset. Experimental results show that our proposed method could achieve a better performance than DNN and comparable even better performance than BLSTM. Meanwhile, compared with the BLSTM, the proposed method drastically reduce the inference time.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132148158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Improved End-to-End Dysarthric Speech Recognition via Meta-learning Based Model Re-initialization 基于元学习模型重新初始化的端到端困难语音识别改进

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-11-03 DOI: 10.1109/ISCSLP49672.2021.9362068

Disong Wang, Jianwei Yu, Xixin Wu, Lifa Sun, Xunying Liu, H. Meng

{"title":"Improved End-to-End Dysarthric Speech Recognition via Meta-learning Based Model Re-initialization","authors":"Disong Wang, Jianwei Yu, Xixin Wu, Lifa Sun, Xunying Liu, H. Meng","doi":"10.1109/ISCSLP49672.2021.9362068","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362068","url":null,"abstract":"Dysarthric speech recognition is a challenging task as dysarthric data is limited and its acoustics deviate significantly from normal speech. Model-based speaker adaptation is a promising method by using the limited dysarthric speech to fine-tune a base model that has been pre-trained from large amounts of normal speech to obtain speaker-dependent models. However, statistic distribution mismatches between the normal and dysarthric speech data limit the adaptation performance of the base model. To address this problem, we propose to re-initialize the base model via meta-learning to obtain a better model initialization. Specifically, we focus on end-to-end models and extend the model-agnostic meta learning (MAML) and Reptile algorithms to meta update the base model by repeatedly simulating adaptation to different dysarthric speakers. As a result, the re-initialized model acquires dysarthric speech knowledge and learns how to perform fast adaptation to unseen dysarthric speakers with improved performance. Experimental results on UASpeech dataset show that the best model with proposed methods achieves 54.2% and 7.6% relative word error rate reduction compared with the base model without finetuning and the model directly fine-tuned from the base model, respectively, and it is comparable with the state-of-the-art hybrid DNN-HMM model.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130666104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection 基于语码切换查询的语音词嵌入系统

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-05-24 DOI: 10.1109/ISCSLP49672.2021.9362056

Murong Ma, Haiwei Wu, Xuyang Wang, Lin Yang, Junjie Wang, Ming Li

引用次数: 6

Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning 在迁移学习中使用基于混合变压器- lstm的端到端ASR来利用文本数据

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-05-21 DOI: 10.1109/ISCSLP49672.2021.9362086

Zhiping Zeng, V. T. Pham, Haihua Xu, Yerbolat Khassanov, Chng Eng Siong, Chongjia Ni, B. Ma

{"title":"Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning","authors":"Zhiping Zeng, V. T. Pham, Haihua Xu, Yerbolat Khassanov, Chng Eng Siong, Chongjia Ni, B. Ma","doi":"10.1109/ISCSLP49672.2021.9362086","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362086","url":null,"abstract":"In this work, we study leveraging extra text data to improve low- resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend the prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131230333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems 改进混合ASR系统中未充分代表的命名实体识别的方法

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-05-18 DOI: 10.1109/ISCSLP49672.2021.9362062

Tingzhi Mao, Yerbolat Khassanov, V. T. Pham, Haihua Xu, Hao Huang, Chng Eng Siong

{"title":"Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems","authors":"Tingzhi Mao, Yerbolat Khassanov, V. T. Pham, Haihua Xu, Hao Huang, Chng Eng Siong","doi":"10.1109/ISCSLP49672.2021.9362062","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362062","url":null,"abstract":"In this paper, we present a series of complementary approaches to improve the recognition of underrepresented named entities (NE) in hybrid ASR systems without compromising overall word error rate performance. The underrepresented words correspond to rare or out-of-vocabulary (OOV) words in the training data, and thereby can't be modeled reliably. We begin with graphemic lexicon which allows to drop the necessity of phonetic models in hybrid ASR. We study it under different settings and demonstrate its effectiveness in dealing with underrepresented NEs. Next, we study the impact of neural language model (LM) with letter-based features derived to handle infrequent words. After that, we attempt to enrich representations of underrepresented NEs in pretrained neural LM by borrowing the embedding representations of rich-represented words. This let us gain significant performance improvement on underrepresented NE recognition. Finally, we boost the likelihood scores of utterances containing NEs in the word lattices rescored by neural LMs and gain further performance improvement. The combination of the aforementioned approaches improves NE recognition by up to 42% relatively.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133039948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders 中文唱腔合成系统:使用时长分配的编解码器声学模型和WaveRNN声码器

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-04-23 DOI: 10.1109/ISCSLP49672.2021.9362104

Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma

{"title":"ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders","authors":"Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma","doi":"10.1109/ISCSLP49672.2021.9362104","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362104","url":null,"abstract":"This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders. Different from the conventional SVS models, the proposed ByteSing employs Tacotron-like encoder-decoder structures as the acoustic models, in which the CBHG models and recurrent neural networks (RNNs) are explored as encoders and decoders respectively. Meanwhile an auxiliary phoneme duration prediction model is utilized to expand the input sequence, which can enhance the model controllable capacity, model stability and tempo prediction accuracy. WaveRNN vocoders are also adopted as neural vocoders to further improve the voice quality of synthesized songs. Both objective and subjective experimental results prove that the SVS method proposed in this paper can produce quite natural, expressive and high-fidelity songs by improving the pitch and spectrogram prediction accuracy and the models using attention mechanism can achieve best performance.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133081813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Rnn-transducer With Language Bias For End-to-end Mandarin-English Code-switching Speech Recognition 基于语言偏差的rnn换能器端到端中英文码转换语音识别

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-02-19 DOI: 10.1109/ISCSLP49672.2021.9362075

Shuai Zhang, Jiangyan Yi, Zhengkun Tian, J. Tao, Ye Bai

引用次数: 21

Towards Fine-Grained Prosody Control for Voice Conversion 面向语音转换的细粒度韵律控制

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2019-10-24 DOI: 10.1109/ISCSLP49672.2021.9362110

Zheng Lian, J. Tao, Zhengqi Wen, Bin Liu, Yibin Zheng, Rongxiu Zhong

引用次数: 17

Sams-Net: A Sliced Attention-based Neural Network for Music Source Separation Sams-Net:一种用于音乐源分离的基于注意力的神经网络

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2019-09-12 DOI: 10.1109/ISCSLP49672.2021.9362081

Tingle Li, Jiawei Chen, Haowen Hou, Ming Li

引用次数: 16