{"title":"无监督子词建模的特征优化DPGMM聚类:对zerospeech 2017的贡献","authors":"Michael Heck, S. Sakti, Satoshi Nakamura","doi":"10.1109/ASRU.2017.8269011","DOIUrl":null,"url":null,"abstract":"This paper describes our unsupervised subword modeling pipeline for the zero resource speech challenge (ZeroSpeech) 2017. Our approach is built around the Dirichlet process Gaussian mixture model (DPGMM) that we use to cluster speech feature vectors into a dynamically sized set of classes. By considering each class an acoustic unit, speech can be represented as sequence of class posteriorgrams. We enhance this method by automatically optimizing the DPGMM sampler's input features in a multi-stage clustering framework, where we unsupervisedly learn transformations using LDA, MLLT and (basis) fMLLR to reduce variance in the features. We show that this optimization considerably boosts the subword modeling quality, according to the performance on the ABX phone discriminability task. For the first time, we apply inferred subword models to previously unseen data from a new set of speakers. We demonstrate our method's good generalization and the effectiveness of its blind speaker adaptation in extensive experiments on a multitude of datasets. Our pipeline has very little need for hyper-parameter adjustment and is entirely unsupervised, i.e., it only takes raw audio recordings as input, without requiring any pre-defined segmentation, explicit speaker IDs or other meta data.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"6 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":"{\"title\":\"Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017\",\"authors\":\"Michael Heck, S. Sakti, Satoshi Nakamura\",\"doi\":\"10.1109/ASRU.2017.8269011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes our unsupervised subword modeling pipeline for the zero resource speech challenge (ZeroSpeech) 2017. Our approach is built around the Dirichlet process Gaussian mixture model (DPGMM) that we use to cluster speech feature vectors into a dynamically sized set of classes. By considering each class an acoustic unit, speech can be represented as sequence of class posteriorgrams. We enhance this method by automatically optimizing the DPGMM sampler's input features in a multi-stage clustering framework, where we unsupervisedly learn transformations using LDA, MLLT and (basis) fMLLR to reduce variance in the features. We show that this optimization considerably boosts the subword modeling quality, according to the performance on the ABX phone discriminability task. For the first time, we apply inferred subword models to previously unseen data from a new set of speakers. We demonstrate our method's good generalization and the effectiveness of its blind speaker adaptation in extensive experiments on a multitude of datasets. Our pipeline has very little need for hyper-parameter adjustment and is entirely unsupervised, i.e., it only takes raw audio recordings as input, without requiring any pre-defined segmentation, explicit speaker IDs or other meta data.\",\"PeriodicalId\":290868,\"journal\":{\"name\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"6 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"49\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU.2017.8269011\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8269011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017
This paper describes our unsupervised subword modeling pipeline for the zero resource speech challenge (ZeroSpeech) 2017. Our approach is built around the Dirichlet process Gaussian mixture model (DPGMM) that we use to cluster speech feature vectors into a dynamically sized set of classes. By considering each class an acoustic unit, speech can be represented as sequence of class posteriorgrams. We enhance this method by automatically optimizing the DPGMM sampler's input features in a multi-stage clustering framework, where we unsupervisedly learn transformations using LDA, MLLT and (basis) fMLLR to reduce variance in the features. We show that this optimization considerably boosts the subword modeling quality, according to the performance on the ABX phone discriminability task. For the first time, we apply inferred subword models to previously unseen data from a new set of speakers. We demonstrate our method's good generalization and the effectiveness of its blind speaker adaptation in extensive experiments on a multitude of datasets. Our pipeline has very little need for hyper-parameter adjustment and is entirely unsupervised, i.e., it only takes raw audio recordings as input, without requiring any pre-defined segmentation, explicit speaker IDs or other meta data.