{"title":"End-to-end text-independent speaker verification with flexibility in utterance duration","authors":"Chunlei Zhang, K. Koishida","doi":"10.1109/ASRU.2017.8268989","DOIUrl":null,"url":null,"abstract":"We continue to investigate end-to-end text-independent speaker verification by incorporating the variability from different utterance durations. Our previous study [1] showed a competitive performance with a triplet loss based end-to-end text-independent speaker verification system. To normalize the duration variability, we provided fixed length inputs to the network by a simple cropping or padding operation. Those operations do not seem ideal, particularly for long duration where some amount of information is discarded, while an i-vector system typically has improved accuracy with an increase in input duration. In this study, we propose to replace the final max/average pooling layer with a Spatial Pyramid Pooling layer in the Inception-Resnet-v1 architecture, which allows us to relax the fixed-length input constraint and train the entire network with the arbitrary size of input in an end-to-end fashion. In this way, the modified network can map variable length utterances into fixed length embeddings. Experiments shows that the new end-to-end system with variable size input relatively reduces EER by 8.4% over the end-to-end system with fixed-length input, and 24.0% over the i-vector/PLDA baseline system. an end-to-end system with.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU.2017.8268989","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
We continue to investigate end-to-end text-independent speaker verification by incorporating the variability from different utterance durations. Our previous study [1] showed a competitive performance with a triplet loss based end-to-end text-independent speaker verification system. To normalize the duration variability, we provided fixed length inputs to the network by a simple cropping or padding operation. Those operations do not seem ideal, particularly for long duration where some amount of information is discarded, while an i-vector system typically has improved accuracy with an increase in input duration. In this study, we propose to replace the final max/average pooling layer with a Spatial Pyramid Pooling layer in the Inception-Resnet-v1 architecture, which allows us to relax the fixed-length input constraint and train the entire network with the arbitrary size of input in an end-to-end fashion. In this way, the modified network can map variable length utterances into fixed length embeddings. Experiments shows that the new end-to-end system with variable size input relatively reduces EER by 8.4% over the end-to-end system with fixed-length input, and 24.0% over the i-vector/PLDA baseline system. an end-to-end system with.