使用深度神经结构学习讲话者的特定特征。

IEEE transactions on neural networks Pub Date : 2011-11-01 Epub Date: 2011-09-26 DOI:10.1109/TNN.2011.2167240
Ke Chen, Ahmad Salman
{"title":"使用深度神经结构学习讲话者的特定特征。","authors":"Ke Chen,&nbsp;Ahmad Salman","doi":"10.1109/TNN.2011.2167240","DOIUrl":null,"url":null,"abstract":"<p><p>Speech signals convey various yet mixed information ranging from linguistic to speaker-specific information. However, most of acoustic representations characterize all different kinds of information as whole, which could hinder either a speech or a speaker recognition (SR) system from producing a better performance. In this paper, we propose a novel deep neural architecture (DNA) especially for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and SR, which results in a speaker-specific overcomplete representation. In order to learn intrinsic speaker-specific characteristics, we come up with an objective function consisting of contrastive losses in terms of speaker similarity/dissimilarity and data reconstruction losses used as regularization to normalize the interference of non-speaker-related information. Moreover, we employ a hybrid learning strategy for learning parameters of the deep neural networks: i.e., local yet greedy layerwise unsupervised pretraining for initialization and global supervised learning for the ultimate discriminative goal. With four Linguistic Data Consortium (LDC) benchmarks and two non-English corpora, we demonstrate that our overcomplete representation is robust in characterizing various speakers, no matter whether their utterances have been used in training our DNA, and highly insensitive to text and languages spoken. Extensive comparative studies suggest that our approach yields favorite results in speaker verification and segmentation. Finally, we discuss several issues concerning our proposed approach.</p>","PeriodicalId":13434,"journal":{"name":"IEEE transactions on neural networks","volume":"22 11","pages":"1744-56"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TNN.2011.2167240","citationCount":"114","resultStr":"{\"title\":\"Learning speaker-specific characteristics with a deep neural architecture.\",\"authors\":\"Ke Chen,&nbsp;Ahmad Salman\",\"doi\":\"10.1109/TNN.2011.2167240\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Speech signals convey various yet mixed information ranging from linguistic to speaker-specific information. However, most of acoustic representations characterize all different kinds of information as whole, which could hinder either a speech or a speaker recognition (SR) system from producing a better performance. In this paper, we propose a novel deep neural architecture (DNA) especially for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and SR, which results in a speaker-specific overcomplete representation. In order to learn intrinsic speaker-specific characteristics, we come up with an objective function consisting of contrastive losses in terms of speaker similarity/dissimilarity and data reconstruction losses used as regularization to normalize the interference of non-speaker-related information. Moreover, we employ a hybrid learning strategy for learning parameters of the deep neural networks: i.e., local yet greedy layerwise unsupervised pretraining for initialization and global supervised learning for the ultimate discriminative goal. With four Linguistic Data Consortium (LDC) benchmarks and two non-English corpora, we demonstrate that our overcomplete representation is robust in characterizing various speakers, no matter whether their utterances have been used in training our DNA, and highly insensitive to text and languages spoken. Extensive comparative studies suggest that our approach yields favorite results in speaker verification and segmentation. Finally, we discuss several issues concerning our proposed approach.</p>\",\"PeriodicalId\":13434,\"journal\":{\"name\":\"IEEE transactions on neural networks\",\"volume\":\"22 11\",\"pages\":\"1744-56\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/TNN.2011.2167240\",\"citationCount\":\"114\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TNN.2011.2167240\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2011/9/26 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TNN.2011.2167240","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2011/9/26 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 114

摘要

语音信号传递各种各样的混合信息,从语言信息到说话人特定的信息。然而,大多数声学表示将所有不同类型的信息作为一个整体来描述,这可能会阻碍语音或说话人识别(SR)系统产生更好的性能。在本文中,我们提出了一种新的深度神经结构(DNA),特别是用于从梅尔频率倒谱系数中学习说话人特定特征,这是语音识别和SR中常用的声学表示,导致说话人特定的过完全表示。为了学习说话人固有的特征,我们提出了一个由说话人相似/不相似度的对比损失和数据重建损失组成的目标函数,用于正则化非说话人相关信息的干扰。此外,我们采用了一种混合学习策略来学习深度神经网络的参数:即初始化的局部贪婪分层无监督预训练和最终判别目标的全局监督学习。通过四个语言数据联盟(LDC)基准和两个非英语语料库,我们证明了我们的过完备表示在描述不同说话者的特征方面是稳健的,无论他们的话语是否被用于训练我们的DNA,并且对所讲的文本和语言高度不敏感。大量的比较研究表明,我们的方法在说话人验证和分割方面取得了令人满意的结果。最后,我们讨论了与我们提出的方法有关的几个问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Learning speaker-specific characteristics with a deep neural architecture.

Speech signals convey various yet mixed information ranging from linguistic to speaker-specific information. However, most of acoustic representations characterize all different kinds of information as whole, which could hinder either a speech or a speaker recognition (SR) system from producing a better performance. In this paper, we propose a novel deep neural architecture (DNA) especially for learning speaker-specific characteristics from mel-frequency cepstral coefficients, an acoustic representation commonly used in both speech recognition and SR, which results in a speaker-specific overcomplete representation. In order to learn intrinsic speaker-specific characteristics, we come up with an objective function consisting of contrastive losses in terms of speaker similarity/dissimilarity and data reconstruction losses used as regularization to normalize the interference of non-speaker-related information. Moreover, we employ a hybrid learning strategy for learning parameters of the deep neural networks: i.e., local yet greedy layerwise unsupervised pretraining for initialization and global supervised learning for the ultimate discriminative goal. With four Linguistic Data Consortium (LDC) benchmarks and two non-English corpora, we demonstrate that our overcomplete representation is robust in characterizing various speakers, no matter whether their utterances have been used in training our DNA, and highly insensitive to text and languages spoken. Extensive comparative studies suggest that our approach yields favorite results in speaker verification and segmentation. Finally, we discuss several issues concerning our proposed approach.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE transactions on neural networks
IEEE transactions on neural networks 工程技术-工程:电子与电气
自引率
0.00%
发文量
2
审稿时长
8.7 months
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信