Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad
{"title":"ASR Benchmarking: Need for a More Representative Conversational Dataset","authors":"Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad","doi":"arxiv-2409.12042","DOIUrl":null,"url":null,"abstract":"Automatic Speech Recognition (ASR) systems have achieved remarkable\nperformance on widely used benchmarks such as LibriSpeech and Fleurs. However,\nthese benchmarks do not adequately reflect the complexities of real-world\nconversational environments, where speech is often unstructured and contains\ndisfluencies such as pauses, interruptions, and diverse accents. In this study,\nwe introduce a multilingual conversational dataset, derived from TalkBank,\nconsisting of unstructured phone conversation between adults. Our results show\na significant performance drop across various state-of-the-art ASR models when\ntested in conversational settings. Furthermore, we observe a correlation\nbetween Word Error Rate and the presence of speech disfluencies, highlighting\nthe critical need for more realistic, conversational ASR benchmarks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic Speech Recognition (ASR) systems have achieved remarkable
performance on widely used benchmarks such as LibriSpeech and Fleurs. However,
these benchmarks do not adequately reflect the complexities of real-world
conversational environments, where speech is often unstructured and contains
disfluencies such as pauses, interruptions, and diverse accents. In this study,
we introduce a multilingual conversational dataset, derived from TalkBank,
consisting of unstructured phone conversation between adults. Our results show
a significant performance drop across various state-of-the-art ASR models when
tested in conversational settings. Furthermore, we observe a correlation
between Word Error Rate and the presence of speech disfluencies, highlighting
the critical need for more realistic, conversational ASR benchmarks.