Full-text Error Correction for Chinese Speech Recognition with Large Language Model

Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang
{"title":"Full-text Error Correction for Chinese Speech Recognition with Large Language Model","authors":"Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang","doi":"arxiv-2409.07790","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have demonstrated substantial potential for\nerror correction in Automatic Speech Recognition (ASR). However, most research\nfocuses on utterances from short-duration speech recordings, which are the\npredominant form of speech data for supervised ASR training. This paper\ninvestigates the effectiveness of LLMs for error correction in full-text\ngenerated by ASR systems from longer speech recordings, such as transcripts\nfrom podcasts, news broadcasts, and meetings. First, we develop a Chinese\ndataset for full-text error correction, named ChFT, utilizing a pipeline that\ninvolves text-to-speech synthesis, ASR, and error-correction pair extractor.\nThis dataset enables us to correct errors across contexts, including both\nfull-text and segment, and to address a broader range of error types, such as\npunctuation restoration and inverse text normalization, thus making the\ncorrection process comprehensive. Second, we fine-tune a pre-trained LLM on the\nconstructed dataset using a diverse set of prompts and target formats, and\nevaluate its performance on full-text error correction. Specifically, we design\nprompts based on full-text and segment, considering various output formats,\nsuch as directly corrected text and JSON-based error-correction pairs. Through\nvarious test settings, including homogeneous, up-to-date, and hard test sets,\nwe find that the fine-tuned LLMs perform well in the full-text setting with\ndifferent prompts, each presenting its own strengths and weaknesses. This\nestablishes a promising baseline for further research. The dataset is available\non the website.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). However, most research focuses on utterances from short-duration speech recordings, which are the predominant form of speech data for supervised ASR training. This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings, such as transcripts from podcasts, news broadcasts, and meetings. First, we develop a Chinese dataset for full-text error correction, named ChFT, utilizing a pipeline that involves text-to-speech synthesis, ASR, and error-correction pair extractor. This dataset enables us to correct errors across contexts, including both full-text and segment, and to address a broader range of error types, such as punctuation restoration and inverse text normalization, thus making the correction process comprehensive. Second, we fine-tune a pre-trained LLM on the constructed dataset using a diverse set of prompts and target formats, and evaluate its performance on full-text error correction. Specifically, we design prompts based on full-text and segment, considering various output formats, such as directly corrected text and JSON-based error-correction pairs. Through various test settings, including homogeneous, up-to-date, and hard test sets, we find that the fine-tuned LLMs perform well in the full-text setting with different prompts, each presenting its own strengths and weaknesses. This establishes a promising baseline for further research. The dataset is available on the website.
利用大语言模型为中文语音识别进行全文纠错
大型语言模型(LLM)在自动语音识别(ASR)的纠错方面具有巨大的潜力。然而,大多数研究都集中在短时语音录音中的语句上,而短时语音录音是有监督自动语音识别(ASR)训练的主要语音数据形式。本文研究了 LLM 在 ASR 系统从较长的语音录音(如播客、新闻广播和会议的文字记录)生成的全文中进行纠错的有效性。首先,我们开发了一个用于全文纠错的中文数据集(名为 ChFT),该数据集采用了一个包含文本到语音合成、ASR 和纠错对提取器的管道。该数据集使我们能够跨语境纠错,包括全文和片段,并处理更广泛的错误类型,如标点符号恢复和反向文本规范化,从而使纠错过程更加全面。其次,我们使用一系列不同的提示和目标格式,在构建的数据集上对预先训练的 LLM 进行微调,并评估其在全文纠错方面的性能。具体来说,我们设计了基于全文和分段的提示,并考虑了各种输出格式,如直接纠错文本和基于 JSON 的纠错对。通过各种测试设置,包括同质测试集、最新测试集和困难测试集,我们发现经过微调的 LLM 在不同提示的全文设置中表现良好,每个提示都有自己的优缺点。这为进一步的研究奠定了良好的基础。该数据集可在网站上获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信