Zhichao Wang, Wenshuo Ge, Xiong Wang, Shan Yang, Wendong Gan, Haitao Chen, Hai Li, Lei Xie, Xiulin Li
{"title":"Accent and Speaker Disentanglement in Many-to-many Voice Conversion","authors":"Zhichao Wang, Wenshuo Ge, Xiong Wang, Shan Yang, Wendong Gan, Haitao Chen, Hai Li, Lei Xie, Xiulin Li","doi":"10.1109/ISCSLP49672.2021.9362120","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362120","url":null,"abstract":"This paper proposes an interesting voice and accent joint conversion approach, which can convert an arbitrary source speaker’s voice to a target speaker with non-native accent. This problem is challenging as each target speaker only has training data in native accent and we need to disentangle accent and speaker information in the conversion model training and re-combine them in the conversion stage. In our recognition-synthesis conversion framework, we manage to solve this problem by two proposed tricks. First, we use accent-dependent speech recognizers to obtain bottleneck features for different accented speakers. This aims to wipe out other factors beyond the linguistic information in the BN features for conversion model training. Second, we propose to use adversarial training to better disentangle the speaker and accent information in our encoder-decoder based conversion model. Specifically, we plug an auxiliary speaker classifier to the encoder, trained with an adversarial loss to wipe out speaker information from the encoder output. Experiments show that our approach is superior to the baseline. The proposed tricks are quite effective in improving accentedness and audio quality and speaker similarity are well maintained.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129212043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cunhang Fan, B. Liu, J. Tao, Jiangyan Yi, Zhengqi Wen, Leichao Song
{"title":"Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning","authors":"Cunhang Fan, B. Liu, J. Tao, Jiangyan Yi, Zhengqi Wen, Leichao Song","doi":"10.1109/ISCSLP49672.2021.9362059","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362059","url":null,"abstract":"Recurrent neural networks (RNNs) have shown significant improvements in recent years for speech enhancement. However, the model complexity and inference time cost of RNNs are much higher than deep feed-forward neural networks (DNNs). Therefore, these limit the applications of speech enhancement. This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning. The TDNN has excellent potential for capturing long range temporal contexts, which utilizes a modular and incremental design. Besides, the TDNN preserves the feed-forward structure so that its inference cost is comparable to standard DNN. To make full use of the training data, we propose a full data learning method for speech enhancement. More specifically, we not only use the noisy-to-clean (input-to-target) to train the enhanced model, but also the clean-to-clean and noise-to-silence data. Therefore, all of the training data can be used to train the enhanced model. Our experiments are conducted on TIMIT dataset. Experimental results show that our proposed method could achieve a better performance than DNN and comparable even better performance than BLSTM. Meanwhile, compared with the BLSTM, the proposed method drastically reduce the inference time.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132148158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved End-to-End Dysarthric Speech Recognition via Meta-learning Based Model Re-initialization","authors":"Disong Wang, Jianwei Yu, Xixin Wu, Lifa Sun, Xunying Liu, H. Meng","doi":"10.1109/ISCSLP49672.2021.9362068","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362068","url":null,"abstract":"Dysarthric speech recognition is a challenging task as dysarthric data is limited and its acoustics deviate significantly from normal speech. Model-based speaker adaptation is a promising method by using the limited dysarthric speech to fine-tune a base model that has been pre-trained from large amounts of normal speech to obtain speaker-dependent models. However, statistic distribution mismatches between the normal and dysarthric speech data limit the adaptation performance of the base model. To address this problem, we propose to re-initialize the base model via meta-learning to obtain a better model initialization. Specifically, we focus on end-to-end models and extend the model-agnostic meta learning (MAML) and Reptile algorithms to meta update the base model by repeatedly simulating adaptation to different dysarthric speakers. As a result, the re-initialized model acquires dysarthric speech knowledge and learns how to perform fast adaptation to unseen dysarthric speakers with improved performance. Experimental results on UASpeech dataset show that the best model with proposed methods achieves 54.2% and 7.6% relative word error rate reduction compared with the base model without finetuning and the model directly fine-tuned from the base model, respectively, and it is comparable with the state-of-the-art hybrid DNN-HMM model.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130666104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Murong Ma, Haiwei Wu, Xuyang Wang, Lin Yang, Junjie Wang, Ming Li
{"title":"Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection","authors":"Murong Ma, Haiwei Wu, Xuyang Wang, Lin Yang, Junjie Wang, Ming Li","doi":"10.1109/ISCSLP49672.2021.9362056","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362056","url":null,"abstract":"In this paper, we propose a deep convolutional neural network-based acoustic word embedding system for code-switching query by example spoken term detection. Different from previous configurations, we combine audio data in two languages for training instead of only using one single language. We trans-form the acoustic features of keyword templates and searching content segments obtained in a sliding manner to fixed-dimensional vectors and calculate the distances between them. An auxiliary variability-invariant loss is also applied to training data within the same word but different speakers. This strategy is used to prevent the extractor from encoding undesired speaker- or accent-related information into the acoustic word embeddings. Experimental results show that our proposed sys-tem produces promising searching results in the code-switching test scenario. With the employment of variability-invariant loss, the searching performance is further enhanced.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125444958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiping Zeng, V. T. Pham, Haihua Xu, Yerbolat Khassanov, Chng Eng Siong, Chongjia Ni, B. Ma
{"title":"Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning","authors":"Zhiping Zeng, V. T. Pham, Haihua Xu, Yerbolat Khassanov, Chng Eng Siong, Chongjia Ni, B. Ma","doi":"10.1109/ISCSLP49672.2021.9362086","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362086","url":null,"abstract":"In this work, we study leveraging extra text data to improve low- resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend the prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131230333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tingzhi Mao, Yerbolat Khassanov, V. T. Pham, Haihua Xu, Hao Huang, Chng Eng Siong
{"title":"Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems","authors":"Tingzhi Mao, Yerbolat Khassanov, V. T. Pham, Haihua Xu, Hao Huang, Chng Eng Siong","doi":"10.1109/ISCSLP49672.2021.9362062","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362062","url":null,"abstract":"In this paper, we present a series of complementary approaches to improve the recognition of underrepresented named entities (NE) in hybrid ASR systems without compromising overall word error rate performance. The underrepresented words correspond to rare or out-of-vocabulary (OOV) words in the training data, and thereby can't be modeled reliably. We begin with graphemic lexicon which allows to drop the necessity of phonetic models in hybrid ASR. We study it under different settings and demonstrate its effectiveness in dealing with underrepresented NEs. Next, we study the impact of neural language model (LM) with letter-based features derived to handle infrequent words. After that, we attempt to enrich representations of underrepresented NEs in pretrained neural LM by borrowing the embedding representations of rich-represented words. This let us gain significant performance improvement on underrepresented NE recognition. Finally, we boost the likelihood scores of utterances containing NEs in the word lattices rescored by neural LMs and gain further performance improvement. The combination of the aforementioned approaches improves NE recognition by up to 42% relatively.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133039948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma
{"title":"ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders","authors":"Yu Gu, Xiang Yin, Yonghui Rao, Yuan Wan, Benlai Tang, Yang Zhang, Jitong Chen, Yuxuan Wang, Zejun Ma","doi":"10.1109/ISCSLP49672.2021.9362104","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362104","url":null,"abstract":"This paper presents ByteSing, a Chinese singing voice synthesis (SVS) system based on duration allocated Tacotron-like acoustic models and WaveRNN neural vocoders. Different from the conventional SVS models, the proposed ByteSing employs Tacotron-like encoder-decoder structures as the acoustic models, in which the CBHG models and recurrent neural networks (RNNs) are explored as encoders and decoders respectively. Meanwhile an auxiliary phoneme duration prediction model is utilized to expand the input sequence, which can enhance the model controllable capacity, model stability and tempo prediction accuracy. WaveRNN vocoders are also adopted as neural vocoders to further improve the voice quality of synthesized songs. Both objective and subjective experimental results prove that the SVS method proposed in this paper can produce quite natural, expressive and high-fidelity songs by improving the pitch and spectrogram prediction accuracy and the models using attention mechanism can achieve best performance.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133081813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Zhang, Jiangyan Yi, Zhengkun Tian, J. Tao, Ye Bai
{"title":"Rnn-transducer With Language Bias For End-to-end Mandarin-English Code-switching Speech Recognition","authors":"Shuai Zhang, Jiangyan Yi, Zhengkun Tian, J. Tao, Ye Bai","doi":"10.1109/ISCSLP49672.2021.9362075","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362075","url":null,"abstract":"Recently, language identity information has been utilized to improve the performance of end-to-end code-switching (CS) speech recognition task. However, previous work use an additional language identification (LID) model as an auxiliary module, which increases computation cost. In this work, we propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem. We use the language identities to bias the model to predict the CS points. This promotes the model to learn the language identity information directly from transcriptions, and no additional LID model is needed. We evaluate the approach on a Mandarin-English CS corpus SEAME. Compared to our RNN-T baseline, the RNN-T with language bias can achieve 16.2% and 12.9% relative mixed error reduction on two test sets, respectively.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125107369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Lian, J. Tao, Zhengqi Wen, Bin Liu, Yibin Zheng, Rongxiu Zhong
{"title":"Towards Fine-Grained Prosody Control for Voice Conversion","authors":"Zheng Lian, J. Tao, Zhengqi Wen, Bin Liu, Yibin Zheng, Rongxiu Zhong","doi":"10.1109/ISCSLP49672.2021.9362110","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362110","url":null,"abstract":"In a typical voice conversion system, previous works utilized various acoustic features (such as the pitch, voiced/unvoiced flag and aperiodicity) of the source speech to control the prosody of converted speech. However, prosody is related with many factors, such as the intonation, stress and rhythm. It is a challenging task to perfectly describe prosody through hand-crafted acoustic features. To address these difficulties, we propose to use prosody embeddings to describe prosody. These embeddings are learned from the source speech in an unsupervised manner. To verify the effectiveness of our proposed method, we conduct experiments on our Mandarin corpus. Experimental results show that our proposed method can improve the speech quality and speaker similarity of the converted speech. What’s more, we observe that our method can even achieve promising results in singing conditions.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124084116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sams-Net: A Sliced Attention-based Neural Network for Music Source Separation","authors":"Tingle Li, Jiawei Chen, Haowen Hou, Ming Li","doi":"10.1109/ISCSLP49672.2021.9362081","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362081","url":null,"abstract":"Convolutional Neural Network (CNN) or Long Short-term Memory (LSTM) based models with the input of spectrogram or waveforms are commonly used for deep learning based audio source separation. In this paper, we propose a Sliced Attention-based neural network (Sams-Net) in the spectrogram domain for the music source separation task. It enables spectral feature interactions with multi-head attention mechanism, achieves easier parallel computing and has a larger receptive field com-pared with LSTMs and CNNs respectively. Experimental results on the MUSDB18 dataset show that the proposed method, with fewer parameters, outperforms most of the state-of-the-art DNN-based methods.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125206701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}