面向野外语音特征的无监督学习

M. Rivière, Emmanuel Dupoux
{"title":"面向野外语音特征的无监督学习","authors":"M. Rivière, Emmanuel Dupoux","doi":"10.1109/SLT48900.2021.9383461","DOIUrl":null,"url":null,"abstract":"Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data \"in the wild\"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"307 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Towards Unsupervised Learning of Speech Features in the Wild\",\"authors\":\"M. Rivière, Emmanuel Dupoux\",\"doi\":\"10.1109/SLT48900.2021.9383461\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data \\\"in the wild\\\"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.\",\"PeriodicalId\":243211,\"journal\":{\"name\":\"2021 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"307 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT48900.2021.9383461\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT48900.2021.9383461","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

摘要

最近关于语音表示的无监督对比学习的研究已经显示出有希望的结果,但到目前为止,主要是应用于干净的、精心策划的语音数据集。它是否也可以用于“野外”未准备好的音频数据?在这里,我们探讨了在这种情况下的三个潜在问题:(i)非语音数据的存在,(ii)嘈杂或低质量的语音数据,以及(iii)说话人分布的不平衡。我们表明,在lib -light train集(它本身是一个相对干净的纯语音数据集)上,这些问题结合起来,相对于ABX分数,性能成本已经高达30%。我们表明,前两个问题可以通过数据过滤来缓解,语音活动检测选择语音片段,而使用干净数据训练的模型的困惑有助于丢弃整个文件。我们证明了第三个问题可以通过在模型的预测分支中学习一个说话人来缓解。我们表明,这些技术建立了更强大的语音特征,可以在低资源环境下转移到ASR任务中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Towards Unsupervised Learning of Speech Features in the Wild
Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data "in the wild"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信