Full-text Error Correction for Chinese Speech Recognition with Large Language Model

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-12 DOI:arxiv-2409.07790

Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang

{"title":"Full-text Error Correction for Chinese Speech Recognition with Large Language Model","authors":"Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang","doi":"arxiv-2409.07790","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have demonstrated substantial potential for\nerror correction in Automatic Speech Recognition (ASR). However, most research\nfocuses on utterances from short-duration speech recordings, which are the\npredominant form of speech data for supervised ASR training. This paper\ninvestigates the effectiveness of LLMs for error correction in full-text\ngenerated by ASR systems from longer speech recordings, such as transcripts\nfrom podcasts, news broadcasts, and meetings. First, we develop a Chinese\ndataset for full-text error correction, named ChFT, utilizing a pipeline that\ninvolves text-to-speech synthesis, ASR, and error-correction pair extractor.\nThis dataset enables us to correct errors across contexts, including both\nfull-text and segment, and to address a broader range of error types, such as\npunctuation restoration and inverse text normalization, thus making the\ncorrection process comprehensive. Second, we fine-tune a pre-trained LLM on the\nconstructed dataset using a diverse set of prompts and target formats, and\nevaluate its performance on full-text error correction. Specifically, we design\nprompts based on full-text and segment, considering various output formats,\nsuch as directly corrected text and JSON-based error-correction pairs. Through\nvarious test settings, including homogeneous, up-to-date, and hard test sets,\nwe find that the fine-tuned LLMs perform well in the full-text setting with\ndifferent prompts, each presenting its own strengths and weaknesses. This\nestablishes a promising baseline for further research. The dataset is available\non the website.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). However, most research focuses on utterances from short-duration speech recordings, which are the predominant form of speech data for supervised ASR training. This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings, such as transcripts from podcasts, news broadcasts, and meetings. First, we develop a Chinese dataset for full-text error correction, named ChFT, utilizing a pipeline that involves text-to-speech synthesis, ASR, and error-correction pair extractor. This dataset enables us to correct errors across contexts, including both full-text and segment, and to address a broader range of error types, such as punctuation restoration and inverse text normalization, thus making the correction process comprehensive. Second, we fine-tune a pre-trained LLM on the constructed dataset using a diverse set of prompts and target formats, and evaluate its performance on full-text error correction. Specifically, we design prompts based on full-text and segment, considering various output formats, such as directly corrected text and JSON-based error-correction pairs. Through various test settings, including homogeneous, up-to-date, and hard test sets, we find that the fine-tuned LLMs perform well in the full-text setting with different prompts, each presenting its own strengths and weaknesses. This establishes a promising baseline for further research. The dataset is available on the website.

查看原文本刊更多论文

利用大语言模型为中文语音识别进行全文纠错

大型语言模型（LLM）在自动语音识别（ASR）的纠错方面具有巨大的潜力。然而，大多数研究都集中在短时语音录音中的语句上，而短时语音录音是有监督自动语音识别（ASR）训练的主要语音数据形式。本文研究了 LLM 在 ASR 系统从较长的语音录音（如播客、新闻广播和会议的文字记录）生成的全文中进行纠错的有效性。首先，我们开发了一个用于全文纠错的中文数据集（名为 ChFT），该数据集采用了一个包含文本到语音合成、ASR 和纠错对提取器的管道。该数据集使我们能够跨语境纠错，包括全文和片段，并处理更广泛的错误类型，如标点符号恢复和反向文本规范化，从而使纠错过程更加全面。其次，我们使用一系列不同的提示和目标格式，在构建的数据集上对预先训练的 LLM 进行微调，并评估其在全文纠错方面的性能。具体来说，我们设计了基于全文和分段的提示，并考虑了各种输出格式，如直接纠错文本和基于 JSON 的纠错对。通过各种测试设置，包括同质测试集、最新测试集和困难测试集，我们发现经过微调的 LLM 在不同提示的全文设置中表现良好，每个提示都有自己的优缺点。这为进一步的研究奠定了良好的基础。该数据集可在网站上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量