Chain-of-Thought Prompting for Speech Translation

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI:arxiv-2409.11538

Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

{"title":"Chain-of-Thought Prompting for Speech Translation","authors":"Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg","doi":"arxiv-2409.11538","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have demonstrated remarkable advancements in\nlanguage understanding and generation. Building on the success of text-based\nLLMs, recent research has adapted these models to use speech embeddings for\nprompting, resulting in Speech-LLM models that exhibit strong performance in\nautomatic speech recognition (ASR) and automatic speech translation (AST). In\nthis work, we propose a novel approach to leverage ASR transcripts as prompts\nfor AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM\nmodel consists of a speech encoder and an encoder-decoder structure\nMegatron-T5. By first decoding speech to generate ASR transcripts and\nsubsequently using these transcripts along with encoded speech for prompting,\nwe guide the speech translation in a two-step process like chain-of-thought\n(CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model\nadaptation and shows superior performance to full model fine-tuning.\nExperimental results show that the proposed CoT prompting significantly\nimproves AST performance, achieving an average increase of 2.4 BLEU points\nacross 6 En->X or X->En AST tasks compared to speech prompting alone.\nAdditionally, compared to a related CoT prediction method that predicts a\nconcatenated sequence of ASR and AST transcripts, our method performs better by\nan average of 2 BLEU points.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation. Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST). In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM model consists of a speech encoder and an encoder-decoder structure Megatron-T5. By first decoding speech to generate ASR transcripts and subsequently using these transcripts along with encoded speech for prompting, we guide the speech translation in a two-step process like chain-of-thought (CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model adaptation and shows superior performance to full model fine-tuning. Experimental results show that the proposed CoT prompting significantly improves AST performance, achieving an average increase of 2.4 BLEU points across 6 En->X or X->En AST tasks compared to speech prompting alone. Additionally, compared to a related CoT prediction method that predicts a concatenated sequence of ASR and AST transcripts, our method performs better by an average of 2 BLEU points.

查看原文本刊更多论文

语音翻译的思维链提示

大语言模型（LLM）在语言理解和生成方面取得了显著进步。在基于文本的大型语言模型取得成功的基础上，最近的研究将这些模型调整为使用语音嵌入进行提示，从而产生了在自动语音识别（ASR）和自动语音翻译（AST）中表现出色的语音大型语言模型。在这项工作中，我们提出了一种新方法，在基于编码器-解码器文本 LLM 的 Speech-LLM 中利用 ASR 转录作为 AST 的提示。语音 LLM 模型由一个语音编码器和一个编码器-解码器结构（Megatron-T5）组成。我们首先对语音进行解码，生成 ASR 转录本，然后使用这些转录本和编码语音进行提示，通过类似于思维链（CoT）提示的两步过程引导语音翻译。实验结果表明，建议的 CoT 提示显著提高了 AST 性能，与单独的语音提示相比，在 6 个 En->X 或 X->En AST 任务中平均提高了 2.4 个 BLEU 点。此外，与预测 ASR 和 AST 转录本合并序列的相关 CoT 预测方法相比，我们的方法平均提高了 2 个 BLEU 点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量