End-to-end text-independent speaker verification with flexibility in utterance duration

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI:10.1109/ASRU.2017.8268989

Chunlei Zhang, K. Koishida

{"title":"End-to-end text-independent speaker verification with flexibility in utterance duration","authors":"Chunlei Zhang, K. Koishida","doi":"10.1109/ASRU.2017.8268989","DOIUrl":null,"url":null,"abstract":"We continue to investigate end-to-end text-independent speaker verification by incorporating the variability from different utterance durations. Our previous study [1] showed a competitive performance with a triplet loss based end-to-end text-independent speaker verification system. To normalize the duration variability, we provided fixed length inputs to the network by a simple cropping or padding operation. Those operations do not seem ideal, particularly for long duration where some amount of information is discarded, while an i-vector system typically has improved accuracy with an increase in input duration. In this study, we propose to replace the final max/average pooling layer with a Spatial Pyramid Pooling layer in the Inception-Resnet-v1 architecture, which allows us to relax the fixed-length input constraint and train the entire network with the arbitrary size of input in an end-to-end fashion. In this way, the modified network can map variable length utterances into fixed length embeddings. Experiments shows that the new end-to-end system with variable size input relatively reduces EER by 8.4% over the end-to-end system with fixed-length input, and 24.0% over the i-vector/PLDA baseline system. an end-to-end system with.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268989","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

We continue to investigate end-to-end text-independent speaker verification by incorporating the variability from different utterance durations. Our previous study [1] showed a competitive performance with a triplet loss based end-to-end text-independent speaker verification system. To normalize the duration variability, we provided fixed length inputs to the network by a simple cropping or padding operation. Those operations do not seem ideal, particularly for long duration where some amount of information is discarded, while an i-vector system typically has improved accuracy with an increase in input duration. In this study, we propose to replace the final max/average pooling layer with a Spatial Pyramid Pooling layer in the Inception-Resnet-v1 architecture, which allows us to relax the fixed-length input constraint and train the entire network with the arbitrary size of input in an end-to-end fashion. In this way, the modified network can map variable length utterances into fixed length embeddings. Experiments shows that the new end-to-end system with variable size input relatively reduces EER by 8.4% over the end-to-end system with fixed-length input, and 24.0% over the i-vector/PLDA baseline system. an end-to-end system with.

查看原文本刊更多论文

端到端文本独立的说话人验证，灵活的话语持续时间

我们将继续研究端到端文本无关的说话人验证，通过结合不同的话语持续时间的可变性。我们之前的研究[1]显示了基于三联体丢失的端到端与文本无关的说话人验证系统的竞争性性能。为了使持续时间的可变性正常化，我们通过简单的裁剪或填充操作向网络提供固定长度的输入。这些操作看起来并不理想，特别是在长时间的情况下，一些信息被丢弃，而i向量系统通常随着输入持续时间的增加而提高准确性。在本研究中，我们建议用Inception-Resnet-v1架构中的空间金字塔池化层取代最终的max/average池化层，这使我们能够放松固定长度的输入约束，并以端到端的方式用任意大小的输入训练整个网络。通过这种方式，改进后的网络可以将可变长度的话语映射到固定长度的嵌入中。实验表明，与输入长度固定的端到端系统相比，输入长度可变的端到端系统相对降低了8.4%的EER，比i-vector/PLDA基线系统相对降低了24.0%。端到端系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量