Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso
{"title":"使用端到端 ASR 模型对实时转录进行评估","authors":"Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso","doi":"arxiv-2409.05674","DOIUrl":null,"url":null,"abstract":"Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly\nevolved in the last few years. Traditional architectures based on pipelines\nhave been replaced by joint end-to-end (E2E) architectures that simplify and\nstreamline the model training process. In addition, new AI training methods,\nsuch as weak-supervised learning have reduced the need for high-quality audio\ndatasets for model training. However, despite all these advancements, little to\nno research has been done on real-time transcription. In real-time scenarios,\nthe audio is not pre-recorded, and the input audio must be fragmented to be\nprocessed by the ASR systems. To achieve real-time requirements, these\nfragments must be as short as possible to reduce latency. However, audio cannot\nbe split at any point as dividing an utterance into two separate fragments will\ngenerate an incorrect transcription. Also, shorter fragments provide less\ncontext for the ASR model. For this reason, it is necessary to design and test\ndifferent splitting algorithms to optimize the quality and delay of the\nresulting transcription. In this paper, three audio splitting algorithms are\nevaluated with different ASR models to determine their impact on both the\nquality of the transcription and the end-to-end delay. The algorithms are\nfragmentation at fixed intervals, voice activity detection (VAD), and\nfragmentation with feedback. The results are compared to the performance of the\nsame model, without audio fragmentation, to determine the effects of this\ndivision. The results show that VAD fragmentation provides the best quality\nwith the highest delay, whereas fragmentation at fixed intervals provides the\nlowest quality and the lowest delay. The newly proposed feedback algorithm\nexchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,\nto the VAD splitting.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of real-time transcriptions using end-to-end ASR models\",\"authors\":\"Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso\",\"doi\":\"arxiv-2409.05674\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly\\nevolved in the last few years. Traditional architectures based on pipelines\\nhave been replaced by joint end-to-end (E2E) architectures that simplify and\\nstreamline the model training process. In addition, new AI training methods,\\nsuch as weak-supervised learning have reduced the need for high-quality audio\\ndatasets for model training. However, despite all these advancements, little to\\nno research has been done on real-time transcription. In real-time scenarios,\\nthe audio is not pre-recorded, and the input audio must be fragmented to be\\nprocessed by the ASR systems. To achieve real-time requirements, these\\nfragments must be as short as possible to reduce latency. However, audio cannot\\nbe split at any point as dividing an utterance into two separate fragments will\\ngenerate an incorrect transcription. Also, shorter fragments provide less\\ncontext for the ASR model. For this reason, it is necessary to design and test\\ndifferent splitting algorithms to optimize the quality and delay of the\\nresulting transcription. In this paper, three audio splitting algorithms are\\nevaluated with different ASR models to determine their impact on both the\\nquality of the transcription and the end-to-end delay. The algorithms are\\nfragmentation at fixed intervals, voice activity detection (VAD), and\\nfragmentation with feedback. The results are compared to the performance of the\\nsame model, without audio fragmentation, to determine the effects of this\\ndivision. The results show that VAD fragmentation provides the best quality\\nwith the highest delay, whereas fragmentation at fixed intervals provides the\\nlowest quality and the lowest delay. The newly proposed feedback algorithm\\nexchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,\\nto the VAD splitting.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":\"37 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.05674\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05674","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
自动语音识别(ASR)或语音到文本(STT)技术在过去几年中得到了长足的发展。基于流水线的传统架构已被端到端 (E2E) 联合架构所取代,后者简化并简化了模型训练过程。此外,新的人工智能训练方法(如弱监督学习)降低了模型训练对高质量音频数据集的需求。然而,尽管取得了所有这些进步,有关实时转录的研究却少之又少。在实时场景中,音频不是预先录制的,输入的音频必须经过分片才能被 ASR 系统处理。为了达到实时要求,这些片段必须尽可能短,以减少延迟。但是,音频不能在任何时候被分割,因为将一个语句分割成两个独立的片段会产生错误的转录。此外,较短的片段为 ASR 模型提供的语境较少。因此,有必要设计和测试不同的分割算法,以优化转录结果的质量和延迟。本文使用不同的 ASR 模型对三种音频分割算法进行了评估,以确定它们对转录质量和端到端延迟的影响。这三种算法分别是固定间隔分割、语音活动检测(VAD)和带反馈的分割。将结果与不进行音频分片的相同模型的性能进行比较,以确定这种分片的效果。结果表明,VAD 分片提供了最好的质量和最高的延迟,而固定间隔分片则提供了最差的质量和最低的延迟。新提出的反馈算法分别以 WER 增加 2-4% 换取 VAD 分割延迟减少 1.5-2s 的效果。
Evaluation of real-time transcriptions using end-to-end ASR models
Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly
evolved in the last few years. Traditional architectures based on pipelines
have been replaced by joint end-to-end (E2E) architectures that simplify and
streamline the model training process. In addition, new AI training methods,
such as weak-supervised learning have reduced the need for high-quality audio
datasets for model training. However, despite all these advancements, little to
no research has been done on real-time transcription. In real-time scenarios,
the audio is not pre-recorded, and the input audio must be fragmented to be
processed by the ASR systems. To achieve real-time requirements, these
fragments must be as short as possible to reduce latency. However, audio cannot
be split at any point as dividing an utterance into two separate fragments will
generate an incorrect transcription. Also, shorter fragments provide less
context for the ASR model. For this reason, it is necessary to design and test
different splitting algorithms to optimize the quality and delay of the
resulting transcription. In this paper, three audio splitting algorithms are
evaluated with different ASR models to determine their impact on both the
quality of the transcription and the end-to-end delay. The algorithms are
fragmentation at fixed intervals, voice activity detection (VAD), and
fragmentation with feedback. The results are compared to the performance of the
same model, without audio fragmentation, to determine the effects of this
division. The results show that VAD fragmentation provides the best quality
with the highest delay, whereas fragmentation at fixed intervals provides the
lowest quality and the lowest delay. The newly proposed feedback algorithm
exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,
to the VAD splitting.