Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg
{"title":"语音翻译的思维链提示","authors":"Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg","doi":"arxiv-2409.11538","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have demonstrated remarkable advancements in\nlanguage understanding and generation. Building on the success of text-based\nLLMs, recent research has adapted these models to use speech embeddings for\nprompting, resulting in Speech-LLM models that exhibit strong performance in\nautomatic speech recognition (ASR) and automatic speech translation (AST). In\nthis work, we propose a novel approach to leverage ASR transcripts as prompts\nfor AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM\nmodel consists of a speech encoder and an encoder-decoder structure\nMegatron-T5. By first decoding speech to generate ASR transcripts and\nsubsequently using these transcripts along with encoded speech for prompting,\nwe guide the speech translation in a two-step process like chain-of-thought\n(CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model\nadaptation and shows superior performance to full model fine-tuning.\nExperimental results show that the proposed CoT prompting significantly\nimproves AST performance, achieving an average increase of 2.4 BLEU points\nacross 6 En->X or X->En AST tasks compared to speech prompting alone.\nAdditionally, compared to a related CoT prediction method that predicts a\nconcatenated sequence of ASR and AST transcripts, our method performs better by\nan average of 2 BLEU points.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Chain-of-Thought Prompting for Speech Translation\",\"authors\":\"Ke Hu, Zhehuai Chen, Chao-Han Huck Yang, Piotr Żelasko, Oleksii Hrinchuk, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg\",\"doi\":\"arxiv-2409.11538\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language models (LLMs) have demonstrated remarkable advancements in\\nlanguage understanding and generation. Building on the success of text-based\\nLLMs, recent research has adapted these models to use speech embeddings for\\nprompting, resulting in Speech-LLM models that exhibit strong performance in\\nautomatic speech recognition (ASR) and automatic speech translation (AST). In\\nthis work, we propose a novel approach to leverage ASR transcripts as prompts\\nfor AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM\\nmodel consists of a speech encoder and an encoder-decoder structure\\nMegatron-T5. By first decoding speech to generate ASR transcripts and\\nsubsequently using these transcripts along with encoded speech for prompting,\\nwe guide the speech translation in a two-step process like chain-of-thought\\n(CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model\\nadaptation and shows superior performance to full model fine-tuning.\\nExperimental results show that the proposed CoT prompting significantly\\nimproves AST performance, achieving an average increase of 2.4 BLEU points\\nacross 6 En->X or X->En AST tasks compared to speech prompting alone.\\nAdditionally, compared to a related CoT prediction method that predicts a\\nconcatenated sequence of ASR and AST transcripts, our method performs better by\\nan average of 2 BLEU points.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11538\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11538","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Large language models (LLMs) have demonstrated remarkable advancements in
language understanding and generation. Building on the success of text-based
LLMs, recent research has adapted these models to use speech embeddings for
prompting, resulting in Speech-LLM models that exhibit strong performance in
automatic speech recognition (ASR) and automatic speech translation (AST). In
this work, we propose a novel approach to leverage ASR transcripts as prompts
for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM
model consists of a speech encoder and an encoder-decoder structure
Megatron-T5. By first decoding speech to generate ASR transcripts and
subsequently using these transcripts along with encoded speech for prompting,
we guide the speech translation in a two-step process like chain-of-thought
(CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model
adaptation and shows superior performance to full model fine-tuning.
Experimental results show that the proposed CoT prompting significantly
improves AST performance, achieving an average increase of 2.4 BLEU points
across 6 En->X or X->En AST tasks compared to speech prompting alone.
Additionally, compared to a related CoT prediction method that predicts a
concatenated sequence of ASR and AST transcripts, our method performs better by
an average of 2 BLEU points.