2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)最新文献

筛选
英文 中文
Age-Invariant Speaker Embedding for Diarization of Cognitive Assessments 基于年龄不变说话人嵌入的认知评估
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362084
Sean Shensheng Xu, M. Mak, Ka Ho WONG, H. Meng, T. Kwok
{"title":"Age-Invariant Speaker Embedding for Diarization of Cognitive Assessments","authors":"Sean Shensheng Xu, M. Mak, Ka Ho WONG, H. Meng, T. Kwok","doi":"10.1109/ISCSLP49672.2021.9362084","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362084","url":null,"abstract":"This paper investigates an age-invariant speaker embedding approach to speaker diarization, which is an essential step towards the automatic cognitive assessments from speech. Studies have shown that incorporating speaker traits (e.g., age, gender, etc.) can improve speaker diarization performance. However, we found that age information in the speaker embeddings is detrimental to speaker diarization if there is a severe mismatch between the age distributions in the training data and test data. To minimize the detrimental effect of age mismatch, an adversarial training strategy is introduced to remove age variability from the utterance-level speaker embeddings. Evaluations on an interactive dialog dataset for Montreal cognitive assessments (MoCA) show that the adversarial training strategy can produce age-invariant embeddings and reduce diarization error rate (DER) by 4.33%. The approach also outperforms the conventional method even with less training data.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115541750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Non-autoregressive Deliberation-Attention based End-to-End ASR 基于非自回归思考-注意的端到端ASR
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362115
Changfeng Gao, Gaofeng Cheng, Jun Zhou, Pengyuan Zhang, Yonghong Yan
{"title":"Non-autoregressive Deliberation-Attention based End-to-End ASR","authors":"Changfeng Gao, Gaofeng Cheng, Jun Zhou, Pengyuan Zhang, Yonghong Yan","doi":"10.1109/ISCSLP49672.2021.9362115","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362115","url":null,"abstract":"Attention-based encoder-decoder end-to-end (E2E) automatic speech recognition (ASR) architectures have achieved the state-of-the-art results on many ASR tasks. However, the conventional attention-based E2E ASR models rely on the autoregressive decoder, which makes the parallel computation in decoding difficult. In this paper, we propose a novel deliberation-attention (D-Att) based E2E ASR architecture, which re-places the autoregressive attention-based decoder with the non-autoregressive frame level D-Att decoder, and thus accelerates the GPU parallel decoding speed significantly. D-Att decoder differs from the conventional attention decoder on two aspects: first, D-Att decoder uses the frame level text embedding (FLTE) generated by an auxiliary ASR model instead of the ground truth transcripts or previous predictions which are required by the conventional attention decoder; second, conventional attention decoder is trained in the left-to-right label-synchronous way, however, D-Att decoder is trained under the supervision of connectionist temporal classification (CTC) loss and utilizes the FLTE to provide the text information. Our experiments on Aishell, HKUST and WSJ benchmarks show that the proposed D-Att E2E ASR models are comparable to the performance of the state-of-the-art autoregressive attention-based transformer E2E ASR baselines, and are 10 times faster with GPU parallel decoding.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124902558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Syllable-Based Acoustic Modeling With Lattice-Free MMI for Mandarin Speech Recognition 基于音节声学建模的无格MMI普通话语音识别
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362050
Jie Li, Zhiyun Fan, Xiaorui Wang, Yan Li
{"title":"Syllable-Based Acoustic Modeling With Lattice-Free MMI for Mandarin Speech Recognition","authors":"Jie Li, Zhiyun Fan, Xiaorui Wang, Yan Li","doi":"10.1109/ISCSLP49672.2021.9362050","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362050","url":null,"abstract":"Most automatic speech recognition (ASR) systems in past decades have used context-dependent (CD) phones as the fundamental acoustic units. However, these phone-based approaches lack an easy and efficient way for modeling long-term temporal dependencies. Compared with phone units, syllables span a longer time, typically several phones, thereby having more stable acoustic realizations. In this work, we aim to train a syllable-based acoustic model for Mandarin ASR with lattice-free maximum mutual information (LF-MMI) criterion. We expect that, the combination of longer linguistic units, the RNN-based model structure and the sequence-level objective function, can result in better modeling of long-term temporal acoustic variations. We make multiple modifications to improve the performance of syllable-based AM and benchmark our models on two large-scale databases. Experimental results show that the proposed syllable-based AM performs much better than the CD phone-based baseline, especially on noisy test sets, with faster decoding speed.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121998018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Spoken Language Understanding with Sememe Knowledge as Domain Knowledge 以语义知识为领域知识的口语理解
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362087
Sixia Li, J. Dang, Longbiao Wang
{"title":"Spoken Language Understanding with Sememe Knowledge as Domain Knowledge","authors":"Sixia Li, J. Dang, Longbiao Wang","doi":"10.1109/ISCSLP49672.2021.9362087","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362087","url":null,"abstract":"Spoken language understanding (SLU) is a key procedure in task-oriented dialogue systems, its performance has been improved a lot due to deep neural network with pre-trained textual features. However, data sparsity and ASR error usually influence the model performance. Previous studies showed that pre-defined rules and domain knowledge such as lexicon features seems to be helpful for solving these issues. However, those methods are not flexible. In this study, we propose a new domain knowledge, ontology based sememe knowledge, and apply it in SLU task via a weighted sum network. To do so, we construct a sememe knowledge base by identifying slots’ meanings and extracting the corresponding sememes from HowNet. We extract sememe sets for characters in given utterance and use them as domain knowledge in SLU task by means of the weighted sum network. Due to the weighted combinations of the sememe sets can extend words’ meanings, the proposed method can help the model to flexibly match a sparse word to a specific slot. Evaluation on a Mandarin corpus showed that the proposed approach achieved better performance comparing to a leading method, and it also showed the robustness to ASR error.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124374969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS 端到端代码转换TTS中段落处理的文本增强
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362099
Chunyu Qiang, J. Tao, Ruibo Fu, Zhengqi Wen, Jiangyan Yi, Tao Wang, Shiming Wang
{"title":"Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS","authors":"Chunyu Qiang, J. Tao, Ruibo Fu, Zhengqi Wen, Jiangyan Yi, Tao Wang, Shiming Wang","doi":"10.1109/ISCSLP49672.2021.9362099","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362099","url":null,"abstract":"Current end-to-end code-switching Text-to-Speech (TTS) can already generate high quality two languages speech in the same utterance with single speaker bilingual corpora. When the speakers of the bilingual corpora are different, the naturalness and consistency of the code-switching TTS will be poor. The cross-lingual embedding layers structure we proposed makes similar syllables in different languages relevant, thus improving the naturalness and consistency of generated speech. In the end-to-end code-switching TTS, there exists problem of prosody instability when synthesizing paragraph text. The text enhancement method we proposed makes the input contain prosodic information and sentence- level context information, thus improving the prosody stability of paragraph text. Experimental results demonstrate the effectiveness of the proposed methods in the naturalness, consistency, and prosody stability. In addition to Mandarin and English, we also apply these methods to Shanghaiese and Cantonese corpora, proving that the methods we proposed can be extended to other languages to build end-to-end code- switching TTS system.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126756494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchically Attending Time-Frequency and Channel Features for Improving Speaker Verification 分层关注时频和信道特征以改进说话人验证
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2021-01-24 DOI: 10.1109/ISCSLP49672.2021.9362054
Chenglong Wang, Jiangyan Yi, J. Tao, Ye Bai, Zhengkun Tian
{"title":"Hierarchically Attending Time-Frequency and Channel Features for Improving Speaker Verification","authors":"Chenglong Wang, Jiangyan Yi, J. Tao, Ye Bai, Zhengkun Tian","doi":"10.1109/ISCSLP49672.2021.9362054","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362054","url":null,"abstract":"Attention-based models have recently shown powerful representation learning ability in speaker recognition. However, most of the attention mechanism based models primarily focus on pooling layers. In this work, we present an end-to-end speaker verification system which leverage time-frequency and channel features hierarchically. To further improve system performance, we employ Large Margin Cosine Loss to optimize the model to determine the optimal loss function. We carry out experiments on the VoxCeleb1 datasets to evaluate the effectiveness of our methods. The results suggest that our best system outperforms the i-vector + PLDA and x-vector system by 53.3% and 7.6%, respectively.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128730563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Unsupervised Cross-Lingual Speech Emotion Recognition Using Domain Adversarial Neural Network 基于领域对抗神经网络的无监督跨语言语音情感识别
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-12-21 DOI: 10.1109/ISCSLP49672.2021.9362058
Xiong Cai, Zhiyong Wu, Kuo Zhong, Bin Su, Dongyang Dai, H. Meng
{"title":"Unsupervised Cross-Lingual Speech Emotion Recognition Using Domain Adversarial Neural Network","authors":"Xiong Cai, Zhiyong Wu, Kuo Zhong, Bin Su, Dongyang Dai, H. Meng","doi":"10.1109/ISCSLP49672.2021.9362058","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362058","url":null,"abstract":"By using deep learning approaches, Speech Emotion Recognition (SER) on a single domain has achieved many excellent results. However, cross-domain SER is still a challenging task due to the distribution shift between source and target domains. In this work, we propose a Domain Adversarial Neural Network (DANN) based approach to mitigate this distribution shift problem for cross-lingual SER. Specifically, we add a language classifier and gradient reversal layer after the feature extractor to force the learned representation both language-independent and emotion-meaningful. Our method is unsupervised, i. e., labels on target language are not required, which makes it easier to apply our method to other languages. Experimental results show the proposed method provides an average absolute improvement of 3.91% over the baseline system for arousal and valence classification task. Furthermore, we find that batch normalization is beneficial to the performance gain of DANN. Therefore we also explore the effect of different ways of data combination for batch normalization.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"17 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116863253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Context-aware RNNLM Rescoring for Conversational Speech Recognition 会话语音识别的上下文感知RNNLM评分
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-11-18 DOI: 10.1109/ISCSLP49672.2021.9362109
Kun Wei, Pengcheng Guo, Hang Lv, Zhen Tu, Lei Xie
{"title":"Context-aware RNNLM Rescoring for Conversational Speech Recognition","authors":"Kun Wei, Pengcheng Guo, Hang Lv, Zhen Tu, Lei Xie","doi":"10.1109/ISCSLP49672.2021.9362109","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362109","url":null,"abstract":"Conversational speech recognition is regarded as a challenging task due to its free-style speaking and long-term contextual dependencies. Prior work has explored the modeling of long-range context through RNNLM rescoring with improved performance. To further take advantage of the persisted nature during a conversation, such as topics or speaker turn, we extend the rescoring procedure to a new context-aware manner. For RNNLM training, we capture the contextual dependencies by concatenating adjacent sentences with various tag words, such as speaker or intention information. For lattice rescoring, the lattice of adjacent sentences are also connected with the first-pass decoded result by tag words. Besides, we also adopt a selective concatenation strategy based on tf-idf, making the best use of contextual similarity to improve transcription performance. Results on four different conversation test sets show that our approach yields up to 13.1% and 6% relative char-error-rate (CER) reduction compared with 1st-pass decoding and common lattice-rescoring, respectively. Index Terms: conversational speech recognition, recurrent neural network language model, lattice-rescoring","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"12 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130925746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Adversarial Training for Multi-domain Speaker Recognition 多域说话人识别的对抗性训练
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-11-17 DOI: 10.1109/ISCSLP49672.2021.9362053
Qing Wang, Wei Rao, Pengcheng Guo, Lei Xie
{"title":"Adversarial Training for Multi-domain Speaker Recognition","authors":"Qing Wang, Wei Rao, Pengcheng Guo, Lei Xie","doi":"10.1109/ISCSLP49672.2021.9362053","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362053","url":null,"abstract":"In real-life applications, the performance of speaker recognition systems always degrades when there is a mismatch between training and evaluation data. Many domain adaptation methods have been successfully used for eliminating the domain mismatches in speaker recognition. However, usually both training and evaluation data themselves can be composed of several subsets. These inner variances of each dataset can also be considered as different domains. Different distributed subsets in source or target domain dataset can also cause multi-domain mismatches, which are influential to speaker recognition performance. In this study, we propose to use adversarial training for multi-domain speaker recognition to solve the domain mismatch and the dataset variance problems. By adopting the proposed method, we are able to obtain both multi-domain-invariant and speaker-discriminative speech representations for speaker recognition. Experimental results on DAC13 dataset indicate that the proposed method is not only effective to solve the multi-domain mismatch problem, but also outperforms the compared unsupervised domain adaptation methods.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125795829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Controllable Emotion Transfer For End-to-End Speech Synthesis 端到端语音合成的可控情感转移
2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-11-17 DOI: 10.1109/ISCSLP49672.2021.9362069
Tao Li, Shan Yang, Liumeng Xue, Lei Xie
{"title":"Controllable Emotion Transfer For End-to-End Speech Synthesis","authors":"Tao Li, Shan Yang, Liumeng Xue, Lei Xie","doi":"10.1109/ISCSLP49672.2021.9362069","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362069","url":null,"abstract":"Emotion embedding space learned from references is a straight-forward approach for emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems. However, the transferred emotion in the synthetic speech is not accurate and expressive enough with emotion category confusions. Moreover, it is hard to select an appropriate reference to deliver desired emotion strength. To solve these problems, we propose a novel approach based on Tacotron. First, we plug two emotion classifiers – one after the reference encoder, one after the decoder output – to enhance the emotion-discriminative ability of the emotion embedding and the predicted mel-spectrum. Second, we adopt style loss to measure the difference between the generated and reference mel-spectrum. The emotion strength in the synthetic speech can be controlled by adjusting the value of the emotion embedding as the emotion embedding can be viewed as the feature map of the mel-spectrum. Experiments on emotion transfer and strength control have shown that the synthetic speech of the proposed method is more accurate and expressive with less emotion category confusions and the control of emotion strength is more salient to listeners.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121667924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信