{"title":"利用说话者和性别信息改进无监督声学词嵌入","authors":"L. V. van Staden, H. Kamper","doi":"10.1109/SAUPEC/RobMech/PRASA48453.2020.9040986","DOIUrl":null,"url":null,"abstract":"For many languages, there is little or no labelled speech data available for training speech processing models. In zero-resource settings where unlabelled speech audio is the only available resource, speech applications for search, discovery and indexing often need to compare speech segments of different durations. Acoustic word embeddings are fixed dimensional representations of variable length speech sequences, allowing for efficient comparisons. Unsupervised acoustic word embedding models often still retain nuisance factors such as a speaker's identity and gender. Here we investigate how to improve the invariance of unsupervised acoustic embeddings to speaker and gender characteristics. We assume that speaker and gender labels are available for the untranscribed training data. We then consider two different methods for normalising out these factors: speaker and gender conditioning, and adversarial training. We apply both methods to two unsupervised embedding models: a recurrent neural network (RNN) autoencoder and a RNN correspondence autoencoder. In a word discrimination task, we find little benefit by explicitly normalising the embeddings to speaker and gender on English data. But on Xitsonga, substantial improvements are achieved. We speculate that this is due to the higher number of speakers present in the unlabelled Xitsonga training data.","PeriodicalId":215514,"journal":{"name":"2020 International SAUPEC/RobMech/PRASA Conference","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Improving Unsupervised Acoustic Word Embeddings using Speaker and Gender Information\",\"authors\":\"L. V. van Staden, H. Kamper\",\"doi\":\"10.1109/SAUPEC/RobMech/PRASA48453.2020.9040986\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For many languages, there is little or no labelled speech data available for training speech processing models. In zero-resource settings where unlabelled speech audio is the only available resource, speech applications for search, discovery and indexing often need to compare speech segments of different durations. Acoustic word embeddings are fixed dimensional representations of variable length speech sequences, allowing for efficient comparisons. Unsupervised acoustic word embedding models often still retain nuisance factors such as a speaker's identity and gender. Here we investigate how to improve the invariance of unsupervised acoustic embeddings to speaker and gender characteristics. We assume that speaker and gender labels are available for the untranscribed training data. We then consider two different methods for normalising out these factors: speaker and gender conditioning, and adversarial training. We apply both methods to two unsupervised embedding models: a recurrent neural network (RNN) autoencoder and a RNN correspondence autoencoder. In a word discrimination task, we find little benefit by explicitly normalising the embeddings to speaker and gender on English data. But on Xitsonga, substantial improvements are achieved. We speculate that this is due to the higher number of speakers present in the unlabelled Xitsonga training data.\",\"PeriodicalId\":215514,\"journal\":{\"name\":\"2020 International SAUPEC/RobMech/PRASA Conference\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International SAUPEC/RobMech/PRASA Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SAUPEC/RobMech/PRASA48453.2020.9040986\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International SAUPEC/RobMech/PRASA Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAUPEC/RobMech/PRASA48453.2020.9040986","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improving Unsupervised Acoustic Word Embeddings using Speaker and Gender Information
For many languages, there is little or no labelled speech data available for training speech processing models. In zero-resource settings where unlabelled speech audio is the only available resource, speech applications for search, discovery and indexing often need to compare speech segments of different durations. Acoustic word embeddings are fixed dimensional representations of variable length speech sequences, allowing for efficient comparisons. Unsupervised acoustic word embedding models often still retain nuisance factors such as a speaker's identity and gender. Here we investigate how to improve the invariance of unsupervised acoustic embeddings to speaker and gender characteristics. We assume that speaker and gender labels are available for the untranscribed training data. We then consider two different methods for normalising out these factors: speaker and gender conditioning, and adversarial training. We apply both methods to two unsupervised embedding models: a recurrent neural network (RNN) autoencoder and a RNN correspondence autoencoder. In a word discrimination task, we find little benefit by explicitly normalising the embeddings to speaker and gender on English data. But on Xitsonga, substantial improvements are achieved. We speculate that this is due to the higher number of speakers present in the unlabelled Xitsonga training data.