无监督子词建模的特征优化DPGMM聚类:对zerospeech 2017的贡献

Michael Heck, S. Sakti, Satoshi Nakamura
{"title":"无监督子词建模的特征优化DPGMM聚类:对zerospeech 2017的贡献","authors":"Michael Heck, S. Sakti, Satoshi Nakamura","doi":"10.1109/ASRU.2017.8269011","DOIUrl":null,"url":null,"abstract":"This paper describes our unsupervised subword modeling pipeline for the zero resource speech challenge (ZeroSpeech) 2017. Our approach is built around the Dirichlet process Gaussian mixture model (DPGMM) that we use to cluster speech feature vectors into a dynamically sized set of classes. By considering each class an acoustic unit, speech can be represented as sequence of class posteriorgrams. We enhance this method by automatically optimizing the DPGMM sampler's input features in a multi-stage clustering framework, where we unsupervisedly learn transformations using LDA, MLLT and (basis) fMLLR to reduce variance in the features. We show that this optimization considerably boosts the subword modeling quality, according to the performance on the ABX phone discriminability task. For the first time, we apply inferred subword models to previously unseen data from a new set of speakers. We demonstrate our method's good generalization and the effectiveness of its blind speaker adaptation in extensive experiments on a multitude of datasets. Our pipeline has very little need for hyper-parameter adjustment and is entirely unsupervised, i.e., it only takes raw audio recordings as input, without requiring any pre-defined segmentation, explicit speaker IDs or other meta data.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"6 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":"{\"title\":\"Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017\",\"authors\":\"Michael Heck, S. Sakti, Satoshi Nakamura\",\"doi\":\"10.1109/ASRU.2017.8269011\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes our unsupervised subword modeling pipeline for the zero resource speech challenge (ZeroSpeech) 2017. Our approach is built around the Dirichlet process Gaussian mixture model (DPGMM) that we use to cluster speech feature vectors into a dynamically sized set of classes. By considering each class an acoustic unit, speech can be represented as sequence of class posteriorgrams. We enhance this method by automatically optimizing the DPGMM sampler's input features in a multi-stage clustering framework, where we unsupervisedly learn transformations using LDA, MLLT and (basis) fMLLR to reduce variance in the features. We show that this optimization considerably boosts the subword modeling quality, according to the performance on the ABX phone discriminability task. For the first time, we apply inferred subword models to previously unseen data from a new set of speakers. We demonstrate our method's good generalization and the effectiveness of its blind speaker adaptation in extensive experiments on a multitude of datasets. Our pipeline has very little need for hyper-parameter adjustment and is entirely unsupervised, i.e., it only takes raw audio recordings as input, without requiring any pre-defined segmentation, explicit speaker IDs or other meta data.\",\"PeriodicalId\":290868,\"journal\":{\"name\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"6 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"49\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU.2017.8269011\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8269011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 49

摘要

本文描述了零资源语音挑战(ZeroSpeech) 2017的无监督子词建模管道。我们的方法是围绕狄利克雷过程高斯混合模型(DPGMM)构建的,我们使用该模型将语音特征向量聚类到一个动态大小的类集合中。通过将每个类视为一个声学单元,语音可以表示为类后置图序列。我们通过在多阶段聚类框架中自动优化DPGMM采样器的输入特征来增强该方法,其中我们使用LDA, MLLT和(basis) fMLLR无监督地学习变换以减少特征中的方差。根据ABX电话判别任务的性能,我们表明这种优化大大提高了子词建模的质量。我们首次将推断子词模型应用于一组新说话者以前未见过的数据。在大量的数据集上进行了大量的实验,证明了该方法具有良好的泛化性和盲说话人自适应的有效性。我们的管道几乎不需要超参数调整,并且完全不受监督,也就是说,它只需要原始音频记录作为输入,而不需要任何预定义的分割,明确的扬声器id或其他元数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017
This paper describes our unsupervised subword modeling pipeline for the zero resource speech challenge (ZeroSpeech) 2017. Our approach is built around the Dirichlet process Gaussian mixture model (DPGMM) that we use to cluster speech feature vectors into a dynamically sized set of classes. By considering each class an acoustic unit, speech can be represented as sequence of class posteriorgrams. We enhance this method by automatically optimizing the DPGMM sampler's input features in a multi-stage clustering framework, where we unsupervisedly learn transformations using LDA, MLLT and (basis) fMLLR to reduce variance in the features. We show that this optimization considerably boosts the subword modeling quality, according to the performance on the ABX phone discriminability task. For the first time, we apply inferred subword models to previously unseen data from a new set of speakers. We demonstrate our method's good generalization and the effectiveness of its blind speaker adaptation in extensive experiments on a multitude of datasets. Our pipeline has very little need for hyper-parameter adjustment and is entirely unsupervised, i.e., it only takes raw audio recordings as input, without requiring any pre-defined segmentation, explicit speaker IDs or other meta data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信