监督声嵌入及其跨语言可移植性

International Conference on Natural Language and Speech Processing Pub Date : 2023-01-03 DOI:10.48550/arXiv.2301.01020

Sreepratha Ram, Hanan Aldarmaki

{"title":"监督声嵌入及其跨语言可移植性","authors":"Sreepratha Ram, Hanan Aldarmaki","doi":"10.48550/arXiv.2301.01020","DOIUrl":null,"url":null,"abstract":"In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise, which is challenging in low-resource settings. Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition, including frame-level feature representations and Acoustic Word Embeddings (AWE) for variable-length segments. However, self-supervised models alone cannot learn perfect separation of the linguistic content as they are trained to optimize indirect objectives. In this work, we experiment with different pre-trained self-supervised features as input to AWE models and show that they work best within a supervised framework. Models trained on English can be transferred to other languages with no adaptation and outperform self-supervised models trained solely on the target languages.","PeriodicalId":405017,"journal":{"name":"International Conference on Natural Language and Speech Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Supervised Acoustic Embeddings And Their Transferability Across Languages\",\"authors\":\"Sreepratha Ram, Hanan Aldarmaki\",\"doi\":\"10.48550/arXiv.2301.01020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise, which is challenging in low-resource settings. Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition, including frame-level feature representations and Acoustic Word Embeddings (AWE) for variable-length segments. However, self-supervised models alone cannot learn perfect separation of the linguistic content as they are trained to optimize indirect objectives. In this work, we experiment with different pre-trained self-supervised features as input to AWE models and show that they work best within a supervised framework. Models trained on English can be transferred to other languages with no adaptation and outperform self-supervised models trained solely on the target languages.\",\"PeriodicalId\":405017,\"journal\":{\"name\":\"International Conference on Natural Language and Speech Processing\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Natural Language and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2301.01020\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Natural Language and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.01020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

在语音识别中，必须对输入信号的语音内容进行建模，同时丢弃不相关的因素，如说话者变化和噪声，这在低资源环境下是具有挑战性的。自监督预训练已被提出作为改进监督和无监督语音识别的一种方法，包括帧级特征表示和针对变长段的声学词嵌入(AWE)。然而，单独的自监督模型无法学习语言内容的完美分离，因为它们被训练为优化间接目标。在这项工作中，我们实验了不同的预训练自监督特征作为AWE模型的输入，并表明它们在监督框架内效果最好。用英语训练的模型可以不需要适应就可以转移到其他语言，并且优于仅用目标语言训练的自监督模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Supervised Acoustic Embeddings And Their Transferability Across Languages

In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise, which is challenging in low-resource settings. Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition, including frame-level feature representations and Acoustic Word Embeddings (AWE) for variable-length segments. However, self-supervised models alone cannot learn perfect separation of the linguistic content as they are trained to optimize indirect objectives. In this work, we experiment with different pre-trained self-supervised features as input to AWE models and show that they work best within a supervised framework. Models trained on English can be transferred to other languages with no adaptation and outperform self-supervised models trained solely on the target languages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Natural Language and Speech Processing

自引率

0.00%

发文量