{"title":"利用大语言模型为中文语音识别进行全文纠错","authors":"Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang","doi":"arxiv-2409.07790","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have demonstrated substantial potential for\nerror correction in Automatic Speech Recognition (ASR). However, most research\nfocuses on utterances from short-duration speech recordings, which are the\npredominant form of speech data for supervised ASR training. This paper\ninvestigates the effectiveness of LLMs for error correction in full-text\ngenerated by ASR systems from longer speech recordings, such as transcripts\nfrom podcasts, news broadcasts, and meetings. First, we develop a Chinese\ndataset for full-text error correction, named ChFT, utilizing a pipeline that\ninvolves text-to-speech synthesis, ASR, and error-correction pair extractor.\nThis dataset enables us to correct errors across contexts, including both\nfull-text and segment, and to address a broader range of error types, such as\npunctuation restoration and inverse text normalization, thus making the\ncorrection process comprehensive. Second, we fine-tune a pre-trained LLM on the\nconstructed dataset using a diverse set of prompts and target formats, and\nevaluate its performance on full-text error correction. Specifically, we design\nprompts based on full-text and segment, considering various output formats,\nsuch as directly corrected text and JSON-based error-correction pairs. Through\nvarious test settings, including homogeneous, up-to-date, and hard test sets,\nwe find that the fine-tuned LLMs perform well in the full-text setting with\ndifferent prompts, each presenting its own strengths and weaknesses. This\nestablishes a promising baseline for further research. The dataset is available\non the website.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Full-text Error Correction for Chinese Speech Recognition with Large Language Model\",\"authors\":\"Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang\",\"doi\":\"arxiv-2409.07790\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) have demonstrated substantial potential for\\nerror correction in Automatic Speech Recognition (ASR). However, most research\\nfocuses on utterances from short-duration speech recordings, which are the\\npredominant form of speech data for supervised ASR training. This paper\\ninvestigates the effectiveness of LLMs for error correction in full-text\\ngenerated by ASR systems from longer speech recordings, such as transcripts\\nfrom podcasts, news broadcasts, and meetings. First, we develop a Chinese\\ndataset for full-text error correction, named ChFT, utilizing a pipeline that\\ninvolves text-to-speech synthesis, ASR, and error-correction pair extractor.\\nThis dataset enables us to correct errors across contexts, including both\\nfull-text and segment, and to address a broader range of error types, such as\\npunctuation restoration and inverse text normalization, thus making the\\ncorrection process comprehensive. Second, we fine-tune a pre-trained LLM on the\\nconstructed dataset using a diverse set of prompts and target formats, and\\nevaluate its performance on full-text error correction. Specifically, we design\\nprompts based on full-text and segment, considering various output formats,\\nsuch as directly corrected text and JSON-based error-correction pairs. Through\\nvarious test settings, including homogeneous, up-to-date, and hard test sets,\\nwe find that the fine-tuned LLMs perform well in the full-text setting with\\ndifferent prompts, each presenting its own strengths and weaknesses. This\\nestablishes a promising baseline for further research. The dataset is available\\non the website.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"12 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07790\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Full-text Error Correction for Chinese Speech Recognition with Large Language Model
Large Language Models (LLMs) have demonstrated substantial potential for
error correction in Automatic Speech Recognition (ASR). However, most research
focuses on utterances from short-duration speech recordings, which are the
predominant form of speech data for supervised ASR training. This paper
investigates the effectiveness of LLMs for error correction in full-text
generated by ASR systems from longer speech recordings, such as transcripts
from podcasts, news broadcasts, and meetings. First, we develop a Chinese
dataset for full-text error correction, named ChFT, utilizing a pipeline that
involves text-to-speech synthesis, ASR, and error-correction pair extractor.
This dataset enables us to correct errors across contexts, including both
full-text and segment, and to address a broader range of error types, such as
punctuation restoration and inverse text normalization, thus making the
correction process comprehensive. Second, we fine-tune a pre-trained LLM on the
constructed dataset using a diverse set of prompts and target formats, and
evaluate its performance on full-text error correction. Specifically, we design
prompts based on full-text and segment, considering various output formats,
such as directly corrected text and JSON-based error-correction pairs. Through
various test settings, including homogeneous, up-to-date, and hard test sets,
we find that the fine-tuned LLMs perform well in the full-text setting with
different prompts, each presenting its own strengths and weaknesses. This
establishes a promising baseline for further research. The dataset is available
on the website.