零语音的复合嵌入系统

Hayato Shibata, Taku Kato, T. Shinozaki, Shinji Watanabe
{"title":"零语音的复合嵌入系统","authors":"Hayato Shibata, Taku Kato, T. Shinozaki, Shinji Watanabe","doi":"10.1109/ASRU.2017.8269012","DOIUrl":null,"url":null,"abstract":"This paper investigates novel composite embedding systems for language-independent high-performance feature extraction using triphone-based DNN-HMM and character-based end-to-end speech recognition systems. The DNN-HMM is trained with phoneme transcripts based on a large-scale Japanese ASR recipe included in the Kaldi toolkit from the Corpus of Spontaneous Japanese (CSJ) with some modifications. The end-to-end ASR system is based on a hybrid architecture consisting of an attention-based encoder-decoder and connectionist temporal classification. This model is trained with multi-language speech data using character transcripts in a pure end-to-end fashion without requiring phonemic representation. Posterior features, PCA-transformed features, and bottleneck features are extracted from the two systems; then, various combinations of features are explored. Additionally, a bypassed autoencoder (bypassed AE) is proposed to normalize speaker characteristics in an unsupervised manner. An evaluation using the ABX test showed that the DNN-HMM-based CSJ bottleneck features resulted in a good performance regardless of the input language. The pre-activation vectors extracted from the multilingual end-to-end system with PCA provided a somewhat better performance than did the CSJ bottleneck features. The bypassed AE yielded an improved performance over a baseline AE. The lowest error rates were obtained by composite features that concatenated the end-to-end features with the CSJ bottleneck features.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Composite embedding systems for ZeroSpeech2017 Track1\",\"authors\":\"Hayato Shibata, Taku Kato, T. Shinozaki, Shinji Watanabe\",\"doi\":\"10.1109/ASRU.2017.8269012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper investigates novel composite embedding systems for language-independent high-performance feature extraction using triphone-based DNN-HMM and character-based end-to-end speech recognition systems. The DNN-HMM is trained with phoneme transcripts based on a large-scale Japanese ASR recipe included in the Kaldi toolkit from the Corpus of Spontaneous Japanese (CSJ) with some modifications. The end-to-end ASR system is based on a hybrid architecture consisting of an attention-based encoder-decoder and connectionist temporal classification. This model is trained with multi-language speech data using character transcripts in a pure end-to-end fashion without requiring phonemic representation. Posterior features, PCA-transformed features, and bottleneck features are extracted from the two systems; then, various combinations of features are explored. Additionally, a bypassed autoencoder (bypassed AE) is proposed to normalize speaker characteristics in an unsupervised manner. An evaluation using the ABX test showed that the DNN-HMM-based CSJ bottleneck features resulted in a good performance regardless of the input language. The pre-activation vectors extracted from the multilingual end-to-end system with PCA provided a somewhat better performance than did the CSJ bottleneck features. The bypassed AE yielded an improved performance over a baseline AE. The lowest error rates were obtained by composite features that concatenated the end-to-end features with the CSJ bottleneck features.\",\"PeriodicalId\":290868,\"journal\":{\"name\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"53 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU.2017.8269012\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8269012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

摘要

本文研究了基于三音符的DNN-HMM和基于字符的端到端语音识别系统的与语言无关的高性能特征提取的新型复合嵌入系统。DNN-HMM使用音素转录本进行训练,该音素转录本基于来自自发日语语料库(CSJ)的Kaldi工具包中的大规模日语ASR配方,并进行了一些修改。端到端ASR系统基于混合架构,包括基于注意力的编码器-解码器和连接主义时态分类。该模型使用多语言语音数据进行训练,使用纯端到端方式的字符转录,而不需要音位表示。从两个系统中提取后验特征、pca变换特征和瓶颈特征;然后,探索各种特征的组合。此外,提出了一种旁路自动编码器(旁路AE),以无监督的方式对说话人的特征进行归一化。使用ABX测试的评估表明,无论输入语言如何,基于dnn - hmm的CSJ瓶颈特征都会产生良好的性能。使用PCA从多语言端到端系统中提取的预激活向量提供了比CSJ瓶颈特征更好的性能。旁路声发射产生了比基线声发射更好的性能。将端到端特征与CSJ瓶颈特征连接起来的复合特征获得了最低的错误率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Composite embedding systems for ZeroSpeech2017 Track1
This paper investigates novel composite embedding systems for language-independent high-performance feature extraction using triphone-based DNN-HMM and character-based end-to-end speech recognition systems. The DNN-HMM is trained with phoneme transcripts based on a large-scale Japanese ASR recipe included in the Kaldi toolkit from the Corpus of Spontaneous Japanese (CSJ) with some modifications. The end-to-end ASR system is based on a hybrid architecture consisting of an attention-based encoder-decoder and connectionist temporal classification. This model is trained with multi-language speech data using character transcripts in a pure end-to-end fashion without requiring phonemic representation. Posterior features, PCA-transformed features, and bottleneck features are extracted from the two systems; then, various combinations of features are explored. Additionally, a bypassed autoencoder (bypassed AE) is proposed to normalize speaker characteristics in an unsupervised manner. An evaluation using the ABX test showed that the DNN-HMM-based CSJ bottleneck features resulted in a good performance regardless of the input language. The pre-activation vectors extracted from the multilingual end-to-end system with PCA provided a somewhat better performance than did the CSJ bottleneck features. The bypassed AE yielded an improved performance over a baseline AE. The lowest error rates were obtained by composite features that concatenated the end-to-end features with the CSJ bottleneck features.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信