{"title":"基于条件生成对抗网络的语音增强蒙古语语音识别","authors":"Zhiqiang Ma, Jinyi Li, Junpeng Zhang","doi":"10.1109/ACAIT56212.2022.10137828","DOIUrl":null,"url":null,"abstract":"Aiming at the problem of uneven regional distribution of speech caused by the lack of labeled data in the Mongolian speech data set, this paper proposes a Mongolian speech data augmentation model based on a conditional generation confrontation network. The model uses conditional speech generators and multiple fusion discriminators for adversarial learning, and uses Mongolian text and specified regional features to generate Mongolian speech with specified regional features. The original data set was augmented by using the methods of speech rate perturbation and spectrogram enhancement, and compared with the end-to-end Mongolian speech recognition model trained on different augment data sets and the original data sets, it was found that the word error rate in the end-to-end Mongolian speech recognition model trained on the augment data set of the specified regional characteristics is 3.1%; Compared with the end-to-end Mongolian speech recognition model trained on the original data set, the speech rate disturbance data set, and the spectrogram enhancement data set, the word error rate dropped by 2%, 0.5%, and 0.8%.","PeriodicalId":398228,"journal":{"name":"2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Speech Augmentation Using Conditional Generative Adversarial Nets in Mongolian Speech Recognition\",\"authors\":\"Zhiqiang Ma, Jinyi Li, Junpeng Zhang\",\"doi\":\"10.1109/ACAIT56212.2022.10137828\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Aiming at the problem of uneven regional distribution of speech caused by the lack of labeled data in the Mongolian speech data set, this paper proposes a Mongolian speech data augmentation model based on a conditional generation confrontation network. The model uses conditional speech generators and multiple fusion discriminators for adversarial learning, and uses Mongolian text and specified regional features to generate Mongolian speech with specified regional features. The original data set was augmented by using the methods of speech rate perturbation and spectrogram enhancement, and compared with the end-to-end Mongolian speech recognition model trained on different augment data sets and the original data sets, it was found that the word error rate in the end-to-end Mongolian speech recognition model trained on the augment data set of the specified regional characteristics is 3.1%; Compared with the end-to-end Mongolian speech recognition model trained on the original data set, the speech rate disturbance data set, and the spectrogram enhancement data set, the word error rate dropped by 2%, 0.5%, and 0.8%.\",\"PeriodicalId\":398228,\"journal\":{\"name\":\"2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT)\",\"volume\":\"77 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ACAIT56212.2022.10137828\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACAIT56212.2022.10137828","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Speech Augmentation Using Conditional Generative Adversarial Nets in Mongolian Speech Recognition
Aiming at the problem of uneven regional distribution of speech caused by the lack of labeled data in the Mongolian speech data set, this paper proposes a Mongolian speech data augmentation model based on a conditional generation confrontation network. The model uses conditional speech generators and multiple fusion discriminators for adversarial learning, and uses Mongolian text and specified regional features to generate Mongolian speech with specified regional features. The original data set was augmented by using the methods of speech rate perturbation and spectrogram enhancement, and compared with the end-to-end Mongolian speech recognition model trained on different augment data sets and the original data sets, it was found that the word error rate in the end-to-end Mongolian speech recognition model trained on the augment data set of the specified regional characteristics is 3.1%; Compared with the end-to-end Mongolian speech recognition model trained on the original data set, the speech rate disturbance data set, and the spectrogram enhancement data set, the word error rate dropped by 2%, 0.5%, and 0.8%.