Hongjie Chen, C. Leung, Lei Xie, B. Ma, Haizhou Li
{"title":"从未转录语音中学习多语言瓶颈特征","authors":"Hongjie Chen, C. Leung, Lei Xie, B. Ma, Haizhou Li","doi":"10.1109/ASRU.2017.8269009","DOIUrl":null,"url":null,"abstract":"We propose to learn a low-dimensional feature representation for multiple languages without access to their manual transcription. The multilingual features are extracted from a shared bottleneck layer of a multi-task learning deep neural network which is trained using un-supervised phoneme-like labels. The unsupervised phoneme-like labels are obtained from language-dependent Dirichlet process Gaussian mixture models (DPGMMs). Vocal tract length normalization (VTLN) is applied to mel-frequency cepstral coefficients to reduce talker variation when DPGMMs are trained. The proposed features are evaluated using the ABX phoneme discriminability test in the Zero Resource Speech Challenge 2017. In the experiments, we show that the proposed features perform well across different languages, and they consistently outperform our previously proposed DPGMM posteriorgrams which topped the performance in the same challenge in 2015.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":"{\"title\":\"Multilingual bottle-neck feature learning from untranscribed speech\",\"authors\":\"Hongjie Chen, C. Leung, Lei Xie, B. Ma, Haizhou Li\",\"doi\":\"10.1109/ASRU.2017.8269009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose to learn a low-dimensional feature representation for multiple languages without access to their manual transcription. The multilingual features are extracted from a shared bottleneck layer of a multi-task learning deep neural network which is trained using un-supervised phoneme-like labels. The unsupervised phoneme-like labels are obtained from language-dependent Dirichlet process Gaussian mixture models (DPGMMs). Vocal tract length normalization (VTLN) is applied to mel-frequency cepstral coefficients to reduce talker variation when DPGMMs are trained. The proposed features are evaluated using the ABX phoneme discriminability test in the Zero Resource Speech Challenge 2017. In the experiments, we show that the proposed features perform well across different languages, and they consistently outperform our previously proposed DPGMM posteriorgrams which topped the performance in the same challenge in 2015.\",\"PeriodicalId\":290868,\"journal\":{\"name\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU.2017.8269009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8269009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multilingual bottle-neck feature learning from untranscribed speech
We propose to learn a low-dimensional feature representation for multiple languages without access to their manual transcription. The multilingual features are extracted from a shared bottleneck layer of a multi-task learning deep neural network which is trained using un-supervised phoneme-like labels. The unsupervised phoneme-like labels are obtained from language-dependent Dirichlet process Gaussian mixture models (DPGMMs). Vocal tract length normalization (VTLN) is applied to mel-frequency cepstral coefficients to reduce talker variation when DPGMMs are trained. The proposed features are evaluated using the ABX phoneme discriminability test in the Zero Resource Speech Challenge 2017. In the experiments, we show that the proposed features perform well across different languages, and they consistently outperform our previously proposed DPGMM posteriorgrams which topped the performance in the same challenge in 2015.