利用大语言模型为中文语音识别进行全文纠错

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-12 DOI:arxiv-2409.07790

Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang

{"title":"利用大语言模型为中文语音识别进行全文纠错","authors":"Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang","doi":"arxiv-2409.07790","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have demonstrated substantial potential for\nerror correction in Automatic Speech Recognition (ASR). However, most research\nfocuses on utterances from short-duration speech recordings, which are the\npredominant form of speech data for supervised ASR training. This paper\ninvestigates the effectiveness of LLMs for error correction in full-text\ngenerated by ASR systems from longer speech recordings, such as transcripts\nfrom podcasts, news broadcasts, and meetings. First, we develop a Chinese\ndataset for full-text error correction, named ChFT, utilizing a pipeline that\ninvolves text-to-speech synthesis, ASR, and error-correction pair extractor.\nThis dataset enables us to correct errors across contexts, including both\nfull-text and segment, and to address a broader range of error types, such as\npunctuation restoration and inverse text normalization, thus making the\ncorrection process comprehensive. Second, we fine-tune a pre-trained LLM on the\nconstructed dataset using a diverse set of prompts and target formats, and\nevaluate its performance on full-text error correction. Specifically, we design\nprompts based on full-text and segment, considering various output formats,\nsuch as directly corrected text and JSON-based error-correction pairs. Through\nvarious test settings, including homogeneous, up-to-date, and hard test sets,\nwe find that the fine-tuned LLMs perform well in the full-text setting with\ndifferent prompts, each presenting its own strengths and weaknesses. This\nestablishes a promising baseline for further research. The dataset is available\non the website.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Full-text Error Correction for Chinese Speech Recognition with Large Language Model\",\"authors\":\"Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang\",\"doi\":\"arxiv-2409.07790\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) have demonstrated substantial potential for\\nerror correction in Automatic Speech Recognition (ASR). However, most research\\nfocuses on utterances from short-duration speech recordings, which are the\\npredominant form of speech data for supervised ASR training. This paper\\ninvestigates the effectiveness of LLMs for error correction in full-text\\ngenerated by ASR systems from longer speech recordings, such as transcripts\\nfrom podcasts, news broadcasts, and meetings. First, we develop a Chinese\\ndataset for full-text error correction, named ChFT, utilizing a pipeline that\\ninvolves text-to-speech synthesis, ASR, and error-correction pair extractor.\\nThis dataset enables us to correct errors across contexts, including both\\nfull-text and segment, and to address a broader range of error types, such as\\npunctuation restoration and inverse text normalization, thus making the\\ncorrection process comprehensive. Second, we fine-tune a pre-trained LLM on the\\nconstructed dataset using a diverse set of prompts and target formats, and\\nevaluate its performance on full-text error correction. Specifically, we design\\nprompts based on full-text and segment, considering various output formats,\\nsuch as directly corrected text and JSON-based error-correction pairs. Through\\nvarious test settings, including homogeneous, up-to-date, and hard test sets,\\nwe find that the fine-tuned LLMs perform well in the full-text setting with\\ndifferent prompts, each presenting its own strengths and weaknesses. This\\nestablishes a promising baseline for further research. The dataset is available\\non the website.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07790\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLM）在自动语音识别（ASR）的纠错方面具有巨大的潜力。然而，大多数研究都集中在短时语音录音中的语句上，而短时语音录音是有监督自动语音识别（ASR）训练的主要语音数据形式。本文研究了 LLM 在 ASR 系统从较长的语音录音（如播客、新闻广播和会议的文字记录）生成的全文中进行纠错的有效性。首先，我们开发了一个用于全文纠错的中文数据集（名为 ChFT），该数据集采用了一个包含文本到语音合成、ASR 和纠错对提取器的管道。该数据集使我们能够跨语境纠错，包括全文和片段，并处理更广泛的错误类型，如标点符号恢复和反向文本规范化，从而使纠错过程更加全面。其次，我们使用一系列不同的提示和目标格式，在构建的数据集上对预先训练的 LLM 进行微调，并评估其在全文纠错方面的性能。具体来说，我们设计了基于全文和分段的提示，并考虑了各种输出格式，如直接纠错文本和基于 JSON 的纠错对。通过各种测试设置，包括同质测试集、最新测试集和困难测试集，我们发现经过微调的 LLM 在不同提示的全文设置中表现良好，每个提示都有自己的优缺点。这为进一步的研究奠定了良好的基础。该数据集可在网站上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Full-text Error Correction for Chinese Speech Recognition with Large Language Model

Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). However, most research focuses on utterances from short-duration speech recordings, which are the predominant form of speech data for supervised ASR training. This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings, such as transcripts from podcasts, news broadcasts, and meetings. First, we develop a Chinese dataset for full-text error correction, named ChFT, utilizing a pipeline that involves text-to-speech synthesis, ASR, and error-correction pair extractor. This dataset enables us to correct errors across contexts, including both full-text and segment, and to address a broader range of error types, such as punctuation restoration and inverse text normalization, thus making the correction process comprehensive. Second, we fine-tune a pre-trained LLM on the constructed dataset using a diverse set of prompts and target formats, and evaluate its performance on full-text error correction. Specifically, we design prompts based on full-text and segment, considering various output formats, such as directly corrected text and JSON-based error-correction pairs. Through various test settings, including homogeneous, up-to-date, and hard test sets, we find that the fine-tuned LLMs perform well in the full-text setting with different prompts, each presenting its own strengths and weaknesses. This establishes a promising baseline for further research. The dataset is available on the website.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量