Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation

IF 5.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-15 DOI:10.1109/TASLP.2024.3444470

Minsu Kim;Jeongsoo Choi;Dahun Kim;Yong Man Ro

{"title":"Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation","authors":"Minsu Kim;Jeongsoo Choi;Dahun Kim;Yong Man Ro","doi":"10.1109/TASLP.2024.3444470","DOIUrl":null,"url":null,"abstract":"This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained model can easily transferred to text-related tasks, even if it is trained in a textless manner. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST), requiring only minimal fine-tuning steps on text inputs. By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, thanks to the many-to-many language training, we show that the UTUT can also perform language translations for novel language pairs that are not present during training as pairs, which has not well been explored in the previous literature.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3934-3946"},"PeriodicalIF":5.1000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10637752/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained model can easily transferred to text-related tasks, even if it is trained in a textless manner. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST), requiring only minimal fine-tuning steps on text inputs. By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, thanks to the many-to-many language training, we show that the UTUT can also perform language translations for novel language pairs that are not present during training as pairs, which has not well been explored in the previous literature.

查看原文本刊更多论文

用于多对多多语言语音到语音翻译的无文本单元到单元训练

本文提出了一种用于多对多多语种语音到语音翻译的无文本训练方法，这种方法也有利于将预先训练好的知识转移到基于文本的系统、文本到语音合成和文本到语音翻译中。为此，我们用语音单元来表示多语言语音，这些语音单元是由自监督语音模型得出的语音特征的离散表示。通过将语音单元视为伪文本，我们可以将注意力集中在语音的语言内容上，这样就可以很容易地将语音和文本模式的语音信息联系起来。通过将学习问题的输入和输出都设置为语音单元，我们提出了在多对多口语翻译环境中训练编码器-解码器模型的方法，即单元对单元翻译（UTUT）。具体来说，编码器以源语言标记为条件，正确理解输入的口语，而解码器则以目标语言标记为条件，生成目标语言的翻译语音。因此，在训练过程中，模型可以建立有关语言理解方式以及如何将它们与不同语言联系起来的知识。由于语音单元可以很容易地通过量化和音素化分别从音频和文本中关联起来，因此即使是以无文本方式训练的模型，也可以很容易地转移到与文本相关的任务中。我们证明，所提出的 UTUT 模型不仅可以有效地用于语音到语音翻译（S2ST），还可以用于多语言文本到语音合成（T2S）和文本到语音翻译（T2ST），只需对文本输入进行最小限度的微调。通过开展涵盖各种语言的综合实验，我们验证了所提方法在各种多语言任务中的有效性。此外，得益于多对多的语言训练，我们证明了UTUT 还能对训练过程中不存在的新语言对进行语言翻译，而这在之前的文献中还没有得到很好的探讨。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.