ASR Benchmarking: Need for a More Representative Conversational Dataset

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI:arxiv-2409.12042

Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad

引用次数: 0

Abstract

Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.

查看原文本刊更多论文

ASR 基准测试：需要更具代表性的对话数据集

自动语音识别（ASR）系统在 LibriSpeech 和 Fleurs 等广泛使用的基准测试中表现出色。然而，这些基准并不能充分反映真实世界对话环境的复杂性，因为对话环境中的语音通常是非结构化的，并包含停顿、中断和不同口音等不流畅现象。在这项研究中，我们引入了一个多语言会话数据集，该数据集来自 TalkBank，由成人之间的非结构化电话会话组成。我们的研究结果表明，在会话环境中进行测试时，各种最先进的 ASR 模型的性能明显下降。此外，我们还观察到单词错误率（Word Error Rate）与语音不流畅（speech disfluencies）之间存在相关性，这凸显了对更真实的会话式 ASR 基准的迫切需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量