{"title":"Full-text Error Correction for Chinese Speech Recognition with Large Language Model","authors":"Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang","doi":"arxiv-2409.07790","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have demonstrated substantial potential for\nerror correction in Automatic Speech Recognition (ASR). However, most research\nfocuses on utterances from short-duration speech recordings, which are the\npredominant form of speech data for supervised ASR training. This paper\ninvestigates the effectiveness of LLMs for error correction in full-text\ngenerated by ASR systems from longer speech recordings, such as transcripts\nfrom podcasts, news broadcasts, and meetings. First, we develop a Chinese\ndataset for full-text error correction, named ChFT, utilizing a pipeline that\ninvolves text-to-speech synthesis, ASR, and error-correction pair extractor.\nThis dataset enables us to correct errors across contexts, including both\nfull-text and segment, and to address a broader range of error types, such as\npunctuation restoration and inverse text normalization, thus making the\ncorrection process comprehensive. Second, we fine-tune a pre-trained LLM on the\nconstructed dataset using a diverse set of prompts and target formats, and\nevaluate its performance on full-text error correction. Specifically, we design\nprompts based on full-text and segment, considering various output formats,\nsuch as directly corrected text and JSON-based error-correction pairs. Through\nvarious test settings, including homogeneous, up-to-date, and hard test sets,\nwe find that the fine-tuned LLMs perform well in the full-text setting with\ndifferent prompts, each presenting its own strengths and weaknesses. This\nestablishes a promising baseline for further research. The dataset is available\non the website.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLMs) have demonstrated substantial potential for
error correction in Automatic Speech Recognition (ASR). However, most research
focuses on utterances from short-duration speech recordings, which are the
predominant form of speech data for supervised ASR training. This paper
investigates the effectiveness of LLMs for error correction in full-text
generated by ASR systems from longer speech recordings, such as transcripts
from podcasts, news broadcasts, and meetings. First, we develop a Chinese
dataset for full-text error correction, named ChFT, utilizing a pipeline that
involves text-to-speech synthesis, ASR, and error-correction pair extractor.
This dataset enables us to correct errors across contexts, including both
full-text and segment, and to address a broader range of error types, such as
punctuation restoration and inverse text normalization, thus making the
correction process comprehensive. Second, we fine-tune a pre-trained LLM on the
constructed dataset using a diverse set of prompts and target formats, and
evaluate its performance on full-text error correction. Specifically, we design
prompts based on full-text and segment, considering various output formats,
such as directly corrected text and JSON-based error-correction pairs. Through
various test settings, including homogeneous, up-to-date, and hard test sets,
we find that the fine-tuned LLMs perform well in the full-text setting with
different prompts, each presenting its own strengths and weaknesses. This
establishes a promising baseline for further research. The dataset is available
on the website.