Decoupled structure for improved adaptability of end-to-end models

IF 3 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2024-07-23 DOI:10.1016/j.specom.2024.103109

Keqi Deng, Philip C. Woodland

{"title":"Decoupled structure for improved adaptability of end-to-end models","authors":"Keqi Deng, Philip C. Woodland","doi":"10.1016/j.specom.2024.103109","DOIUrl":null,"url":null,"abstract":"<div><p>Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data. To solve this problem, this paper proposes decoupled structures for attention-based encoder–decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data. It is also shown that the decoupled structure can be used to boost cross-domain speech translation quality while retaining the intra-domain performance.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103109"},"PeriodicalIF":3.0000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000803/pdfft?md5=7e35ebdc40ecd26754dcc103e392268c&pid=1-s2.0-S0167639324000803-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000803","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data. To solve this problem, this paper proposes decoupled structures for attention-based encoder–decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data. It is also shown that the decoupled structure can be used to boost cross-domain speech translation quality while retaining the intra-domain performance.

查看原文本刊更多论文

解耦结构可提高端到端模型的适应性

尽管端到端（E2E）可训练自动语音识别（ASR）通过联合学习声学和语言信息取得了巨大成功，但它仍然受到领域转移的影响，从而限制了潜在的应用。E2E ASR 模型隐含地学习了一个内部语言模型（LM），该模型描述了源域的训练分布，而 E2E 可训练的特性使得内部 LM 难以适应纯文本数据的目标域。为了解决这个问题，本文提出了基于注意力的编码器-解码器（Decoupled-AED）和神经换能器（Decoupled-Transducer）模型的解耦结构，它可以在离线和在线场景下实现灵活的域适应，同时保持稳健的域内性能。为此，E2E 模型解码器（或预测网络）的声学和语言部分是解耦的，使得语言部分（即内部 LM）可以替换。当遇到领域转换时，内部 LM 可在推理过程中直接替换为目标领域的 LM，而无需重新训练或使用特定领域的语音-文本配对数据。在 LibriSpeech-100h 语料库上训练的 E2E ASR 模型的实验表明，所提出的解耦结构在 TED-LIUM 2 和 AESRC2020 语料库上分别降低了 15.1% 和 17.2% 的相对词错误率，同时仍能保持域内数据的性能。研究还表明，解耦结构可用于提高跨域语音翻译质量，同时保持域内性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.