Tight Integrated End-to-End Training for Cascaded Speech Translation

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-24 DOI:10.1109/SLT48900.2021.9383462

Parnia Bahar, Tobias Bieschke, R. Schlüter, H. Ney

{"title":"Tight Integrated End-to-End Training for Cascaded Speech Translation","authors":"Parnia Bahar, Tobias Bieschke, R. Schlüter, H. Ney","doi":"10.1109/SLT48900.2021.9383462","DOIUrl":null,"url":null,"abstract":"A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is a tightly integrated method that passes renormalized source word posterior distributions as a soft decision instead of one-hot vectors and enables backpropagation. Therefore, it provides both transcriptions and translations and achieves strong consistency between them. Our experiments on four tasks with different data scenarios show that the model outperforms cascade models up to 1.8% in BLEU and 2.0% in TER and is superior compared to direct models.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"9 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT48900.2021.9383462","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is often behind the cascade system. To use an intermediate representation and preserve the end-to-end trainability, previous studies have proposed using two-stage models by passing the hidden vectors of the recognizer into the decoder of the MT model and ignoring the MT encoder. This work explores the feasibility of collapsing the entire cascade components into a single end-to-end trainable model by optimizing all parameters of ASR and MT models jointly without ignoring any learned parameters. It is a tightly integrated method that passes renormalized source word posterior distributions as a soft decision instead of one-hot vectors and enables backpropagation. Therefore, it provides both transcriptions and translations and achieves strong consistency between them. Our experiments on four tasks with different data scenarios show that the model outperforms cascade models up to 1.8% in BLEU and 2.0% in TER and is superior compared to direct models.

查看原文本刊更多论文

紧密集成的端到端级联语音翻译训练

级联语音翻译模型依赖于离散的、不可微的转录，从源端提供监督信号，帮助源语音和目标文本之间的转换。这种建模受到ASR和MT模型之间误差传播的影响。直接语音翻译是避免错误传播的另一种方法;然而，它的性能往往落后于串级系统。为了使用中间表示并保持端到端可训练性，先前的研究提出使用两阶段模型，将识别器的隐藏向量传递到机器翻译模型的解码器中，而忽略机器翻译编码器。这项工作探索了通过联合优化ASR和MT模型的所有参数，在不忽略任何学习参数的情况下，将整个级联组件折叠成单个端到端可训练模型的可行性。它是一种紧密集成的方法，将源词的重归一化后向分布作为软决策传递，而不是单热向量，并允许反向传播。因此，它既提供抄写，也提供翻译，并且两者之间具有很强的一致性。我们在不同数据场景下的4个任务上的实验表明，该模型在BLEU和TER中分别优于级联模型1.8%和2.0%，优于直接模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量