Reducing Domain mismatch in Self-supervised speech pre-training

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-736

M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano

{"title":"Reducing Domain mismatch in Self-supervised speech pre-training","authors":"M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano","doi":"10.21437/interspeech.2022-736","DOIUrl":null,"url":null,"abstract":"Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on speciﬁc samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples by performing a ﬁne-grained data selection. ATM performs masking over the highly conﬁdent input frames as chosen by the scorer. This allows the model to learn meaningful representations. We conduct ﬁne-tuning experiments on two well-benchmarked cor-pora: LibriSpeech (matching the pre-training data) and, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efﬁcacy of ATM on signiﬁcantly improving the recognition performance under mismatched conditions while still yielding modest improvements under matched conditions.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3028-3032"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-736","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on speciﬁc samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples by performing a ﬁne-grained data selection. ATM performs masking over the highly conﬁdent input frames as chosen by the scorer. This allows the model to learn meaningful representations. We conduct ﬁne-tuning experiments on two well-benchmarked cor-pora: LibriSpeech (matching the pre-training data) and, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efﬁcacy of ATM on signiﬁcantly improving the recognition performance under mismatched conditions while still yielding modest improvements under matched conditions.

查看原文本刊更多论文

减少自监督语音预训练中的域不匹配

掩码语音建模(MSM)方法，如wav2vec2或w2v-BERT，学习在话语中随机掩码的语音帧上的表示。虽然这些方法提高了自动语音识别(ASR)系统的性能，但它们有一个主要的局限性。他们对所有的无监督语音样本的权重都是相等的，这阻碍了学习，因为不是所有的样本都有相关的信息来学习有意义的表示。在这项工作中，我们解决了这个限制。我们提出了一种在MSM预训练中关注特定样本的新方法ask2mask (ATM)。ATM采用外部ASR模型或评分器通过执行细粒度数据选择来对无监督输入样本进行加权。ATM对评分者选择的高度自信的输入帧执行屏蔽。这允许模型学习有意义的表示。我们对libisspeech(与预训练数据匹配)和AMI和CHiME-6(与预训练数据不匹配)两个经过良好基准测试的corpora进行了微调实验。结果证实了ATM在不匹配条件下显著提高识别性能，而在匹配条件下仍有适度的提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Interspeech

自引率

0.00%

发文量