2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)最新文献_第6页

Age-Invariant Speaker Embedding for Diarization of Cognitive Assessments 基于年龄不变说话人嵌入的认知评估

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362084

Sean Shensheng Xu, M. Mak, Ka Ho WONG, H. Meng, T. Kwok

引用次数: 3

Non-autoregressive Deliberation-Attention based End-to-End ASR 基于非自回归思考-注意的端到端ASR

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362115

Changfeng Gao, Gaofeng Cheng, Jun Zhou, Pengyuan Zhang, Yonghong Yan

{"title":"Non-autoregressive Deliberation-Attention based End-to-End ASR","authors":"Changfeng Gao, Gaofeng Cheng, Jun Zhou, Pengyuan Zhang, Yonghong Yan","doi":"10.1109/ISCSLP49672.2021.9362115","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362115","url":null,"abstract":"Attention-based encoder-decoder end-to-end (E2E) automatic speech recognition (ASR) architectures have achieved the state-of-the-art results on many ASR tasks. However, the conventional attention-based E2E ASR models rely on the autoregressive decoder, which makes the parallel computation in decoding difficult. In this paper, we propose a novel deliberation-attention (D-Att) based E2E ASR architecture, which re-places the autoregressive attention-based decoder with the non-autoregressive frame level D-Att decoder, and thus accelerates the GPU parallel decoding speed significantly. D-Att decoder differs from the conventional attention decoder on two aspects: first, D-Att decoder uses the frame level text embedding (FLTE) generated by an auxiliary ASR model instead of the ground truth transcripts or previous predictions which are required by the conventional attention decoder; second, conventional attention decoder is trained in the left-to-right label-synchronous way, however, D-Att decoder is trained under the supervision of connectionist temporal classification (CTC) loss and utilizes the FLTE to provide the text information. Our experiments on Aishell, HKUST and WSJ benchmarks show that the proposed D-Att E2E ASR models are comparable to the performance of the state-of-the-art autoregressive attention-based transformer E2E ASR baselines, and are 10 times faster with GPU parallel decoding.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124902558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Syllable-Based Acoustic Modeling With Lattice-Free MMI for Mandarin Speech Recognition 基于音节声学建模的无格MMI普通话语音识别

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362050

Jie Li, Zhiyun Fan, Xiaorui Wang, Yan Li

引用次数: 1

Spoken Language Understanding with Sememe Knowledge as Domain Knowledge 以语义知识为领域知识的口语理解

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362087

Sixia Li, J. Dang, Longbiao Wang

{"title":"Spoken Language Understanding with Sememe Knowledge as Domain Knowledge","authors":"Sixia Li, J. Dang, Longbiao Wang","doi":"10.1109/ISCSLP49672.2021.9362087","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362087","url":null,"abstract":"Spoken language understanding (SLU) is a key procedure in task-oriented dialogue systems, its performance has been improved a lot due to deep neural network with pre-trained textual features. However, data sparsity and ASR error usually influence the model performance. Previous studies showed that pre-defined rules and domain knowledge such as lexicon features seems to be helpful for solving these issues. However, those methods are not flexible. In this study, we propose a new domain knowledge, ontology based sememe knowledge, and apply it in SLU task via a weighted sum network. To do so, we construct a sememe knowledge base by identifying slots’ meanings and extracting the corresponding sememes from HowNet. We extract sememe sets for characters in given utterance and use them as domain knowledge in SLU task by means of the weighted sum network. Due to the weighted combinations of the sememe sets can extend words’ meanings, the proposed method can help the model to flexibly match a sparse word to a specific slot. Evaluation on a Mandarin corpus showed that the proposed approach achieved better performance comparing to a leading method, and it also showed the robustness to ASR error.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124374969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS 端到端代码转换TTS中段落处理的文本增强

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362099

Chunyu Qiang, J. Tao, Ruibo Fu, Zhengqi Wen, Jiangyan Yi, Tao Wang, Shiming Wang

{"title":"Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS","authors":"Chunyu Qiang, J. Tao, Ruibo Fu, Zhengqi Wen, Jiangyan Yi, Tao Wang, Shiming Wang","doi":"10.1109/ISCSLP49672.2021.9362099","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362099","url":null,"abstract":"Current end-to-end code-switching Text-to-Speech (TTS) can already generate high quality two languages speech in the same utterance with single speaker bilingual corpora. When the speakers of the bilingual corpora are different, the naturalness and consistency of the code-switching TTS will be poor. The cross-lingual embedding layers structure we proposed makes similar syllables in different languages relevant, thus improving the naturalness and consistency of generated speech. In the end-to-end code-switching TTS, there exists problem of prosody instability when synthesizing paragraph text. The text enhancement method we proposed makes the input contain prosodic information and sentence- level context information, thus improving the prosody stability of paragraph text. Experimental results demonstrate the effectiveness of the proposed methods in the naturalness, consistency, and prosody stability. In addition to Mandarin and English, we also apply these methods to Shanghaiese and Cantonese corpora, proving that the methods we proposed can be extended to other languages to build end-to-end code- switching TTS system.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126756494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hierarchically Attending Time-Frequency and Channel Features for Improving Speaker Verification 分层关注时频和信道特征以改进说话人验证

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362054

Chenglong Wang, Jiangyan Yi, J. Tao, Ye Bai, Zhengkun Tian

引用次数: 5

Unsupervised Cross-Lingual Speech Emotion Recognition Using Domain Adversarial Neural Network 基于领域对抗神经网络的无监督跨语言语音情感识别

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-12-21 DOI: 10.1109/ISCSLP49672.2021.9362058

Xiong Cai, Zhiyong Wu, Kuo Zhong, Bin Su, Dongyang Dai, H. Meng

{"title":"Unsupervised Cross-Lingual Speech Emotion Recognition Using Domain Adversarial Neural Network","authors":"Xiong Cai, Zhiyong Wu, Kuo Zhong, Bin Su, Dongyang Dai, H. Meng","doi":"10.1109/ISCSLP49672.2021.9362058","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362058","url":null,"abstract":"By using deep learning approaches, Speech Emotion Recognition (SER) on a single domain has achieved many excellent results. However, cross-domain SER is still a challenging task due to the distribution shift between source and target domains. In this work, we propose a Domain Adversarial Neural Network (DANN) based approach to mitigate this distribution shift problem for cross-lingual SER. Specifically, we add a language classifier and gradient reversal layer after the feature extractor to force the learned representation both language-independent and emotion-meaningful. Our method is unsupervised, i. e., labels on target language are not required, which makes it easier to apply our method to other languages. Experimental results show the proposed method provides an average absolute improvement of 3.91% over the baseline system for arousal and valence classification task. Furthermore, we find that batch normalization is beneficial to the performance gain of DANN. Therefore we also explore the effect of different ways of data combination for batch normalization.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"17 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116863253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Context-aware RNNLM Rescoring for Conversational Speech Recognition 会话语音识别的上下文感知RNNLM评分

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-11-18 DOI: 10.1109/ISCSLP49672.2021.9362109

Kun Wei, Pengcheng Guo, Hang Lv, Zhen Tu, Lei Xie

{"title":"Context-aware RNNLM Rescoring for Conversational Speech Recognition","authors":"Kun Wei, Pengcheng Guo, Hang Lv, Zhen Tu, Lei Xie","doi":"10.1109/ISCSLP49672.2021.9362109","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362109","url":null,"abstract":"Conversational speech recognition is regarded as a challenging task due to its free-style speaking and long-term contextual dependencies. Prior work has explored the modeling of long-range context through RNNLM rescoring with improved performance. To further take advantage of the persisted nature during a conversation, such as topics or speaker turn, we extend the rescoring procedure to a new context-aware manner. For RNNLM training, we capture the contextual dependencies by concatenating adjacent sentences with various tag words, such as speaker or intention information. For lattice rescoring, the lattice of adjacent sentences are also connected with the first-pass decoded result by tag words. Besides, we also adopt a selective concatenation strategy based on tf-idf, making the best use of contextual similarity to improve transcription performance. Results on four different conversation test sets show that our approach yields up to 13.1% and 6% relative char-error-rate (CER) reduction compared with 1st-pass decoding and common lattice-rescoring, respectively. Index Terms: conversational speech recognition, recurrent neural network language model, lattice-rescoring","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"12 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130925746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Adversarial Training for Multi-domain Speaker Recognition 多域说话人识别的对抗性训练

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-11-17 DOI: 10.1109/ISCSLP49672.2021.9362053

Qing Wang, Wei Rao, Pengcheng Guo, Lei Xie

{"title":"Adversarial Training for Multi-domain Speaker Recognition","authors":"Qing Wang, Wei Rao, Pengcheng Guo, Lei Xie","doi":"10.1109/ISCSLP49672.2021.9362053","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362053","url":null,"abstract":"In real-life applications, the performance of speaker recognition systems always degrades when there is a mismatch between training and evaluation data. Many domain adaptation methods have been successfully used for eliminating the domain mismatches in speaker recognition. However, usually both training and evaluation data themselves can be composed of several subsets. These inner variances of each dataset can also be considered as different domains. Different distributed subsets in source or target domain dataset can also cause multi-domain mismatches, which are influential to speaker recognition performance. In this study, we propose to use adversarial training for multi-domain speaker recognition to solve the domain mismatch and the dataset variance problems. By adopting the proposed method, we are able to obtain both multi-domain-invariant and speaker-discriminative speech representations for speaker recognition. Experimental results on DAC13 dataset indicate that the proposed method is not only effective to solve the multi-domain mismatch problem, but also outperforms the compared unsupervised domain adaptation methods.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125795829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Controllable Emotion Transfer For End-to-End Speech Synthesis 端到端语音合成的可控情感转移

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-11-17 DOI: 10.1109/ISCSLP49672.2021.9362069

Tao Li, Shan Yang, Liumeng Xue, Lei Xie

{"title":"Controllable Emotion Transfer For End-to-End Speech Synthesis","authors":"Tao Li, Shan Yang, Liumeng Xue, Lei Xie","doi":"10.1109/ISCSLP49672.2021.9362069","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362069","url":null,"abstract":"Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred emotion in the synthetic speech is not accurate and expressive enough with emotion category confusions. Moreover, it is hard to select an appropriate reference to deliver desired emotion strength. To solve these problems, we propose a novel approach based on Tacotron. First, we plug two emotion classifiers – one after the reference encoder, one after the decoder output – to enhance the emotion-discriminative ability of the emotion embedding and the predicted mel-spectrum. Second, we adopt style loss to measure the difference between the generated and reference mel-spectrum. The emotion strength in the synthetic speech can be controlled by adjusting the value of the emotion embedding as the emotion embedding can be viewed as the feature map of the mel-spectrum. Experiments on emotion transfer and strength control have shown that the synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121667924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 56