{"title":"An item response theory framework to evaluate automatic speech recognition systems against speech difficulty","authors":"Chaina Santos Oliveira, Ricardo B.C. Prudêncio","doi":"10.1016/j.csl.2025.101817","DOIUrl":null,"url":null,"abstract":"<div><div>Evaluating the performance of Automatic Speech Recognition (ASR) systems is very relevant for selecting good techniques and understanding their advantages and limitations. ASR systems are usually evaluated by adopting test sets of audio speeches, ideally with different difficulty levels. In this sense, it is important to analyse whether a system under test correctly transcribes easy test speeches, while being robust to the most difficult ones. In this paper, a novel framework is proposed for evaluating ASR systems, which covers two complementary issues: (1) to measure the difficulty of each test speech; and (2) to analyse each ASR system’s performance against the difficulty level. Regarding the first issue, the framework measures speech difficulty by adopting Item Response Theory (IRT). Regarding the second issue, the Recognizer Characteristic Curve (RCC) is proposed, which is a plot of the ASR system’s performance versus speech difficulty. ASR performance is further analysed by a two-dimensional plot, in which speech difficulty is decomposed by IRT into sentence difficulty and speaker quality. In the experiments, the proposed framework was applied in a test set produced by adopting text-to-speech tools, with diverse speakers and sentences. Additionally, noise injection was applied to produce test items with even higher difficulty levels. In the experiments, noise injection actually increases difficulty and generates a wide variety of speeches to assess ASR performance. However, it is essential to pay attention that high noise levels can lead to an unreliable evaluation. The proposed plots were helpful for both identifying robust ASR systems as well as for choosing the noise level that results in both diversity and reliability.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101817"},"PeriodicalIF":3.1000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000427","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Evaluating the performance of Automatic Speech Recognition (ASR) systems is very relevant for selecting good techniques and understanding their advantages and limitations. ASR systems are usually evaluated by adopting test sets of audio speeches, ideally with different difficulty levels. In this sense, it is important to analyse whether a system under test correctly transcribes easy test speeches, while being robust to the most difficult ones. In this paper, a novel framework is proposed for evaluating ASR systems, which covers two complementary issues: (1) to measure the difficulty of each test speech; and (2) to analyse each ASR system’s performance against the difficulty level. Regarding the first issue, the framework measures speech difficulty by adopting Item Response Theory (IRT). Regarding the second issue, the Recognizer Characteristic Curve (RCC) is proposed, which is a plot of the ASR system’s performance versus speech difficulty. ASR performance is further analysed by a two-dimensional plot, in which speech difficulty is decomposed by IRT into sentence difficulty and speaker quality. In the experiments, the proposed framework was applied in a test set produced by adopting text-to-speech tools, with diverse speakers and sentences. Additionally, noise injection was applied to produce test items with even higher difficulty levels. In the experiments, noise injection actually increases difficulty and generates a wide variety of speeches to assess ASR performance. However, it is essential to pay attention that high noise levels can lead to an unreliable evaluation. The proposed plots were helpful for both identifying robust ASR systems as well as for choosing the noise level that results in both diversity and reliability.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.