An item response theory framework to evaluate automatic speech recognition systems against speech difficulty

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-05-16 DOI:10.1016/j.csl.2025.101817

Chaina Santos Oliveira, Ricardo B.C. Prudêncio

{"title":"An item response theory framework to evaluate automatic speech recognition systems against speech difficulty","authors":"Chaina Santos Oliveira, Ricardo B.C. Prudêncio","doi":"10.1016/j.csl.2025.101817","DOIUrl":null,"url":null,"abstract":"<div><div>Evaluating the performance of Automatic Speech Recognition (ASR) systems is very relevant for selecting good techniques and understanding their advantages and limitations. ASR systems are usually evaluated by adopting test sets of audio speeches, ideally with different difficulty levels. In this sense, it is important to analyse whether a system under test correctly transcribes easy test speeches, while being robust to the most difficult ones. In this paper, a novel framework is proposed for evaluating ASR systems, which covers two complementary issues: (1) to measure the difficulty of each test speech; and (2) to analyse each ASR system’s performance against the difficulty level. Regarding the first issue, the framework measures speech difficulty by adopting Item Response Theory (IRT). Regarding the second issue, the Recognizer Characteristic Curve (RCC) is proposed, which is a plot of the ASR system’s performance versus speech difficulty. ASR performance is further analysed by a two-dimensional plot, in which speech difficulty is decomposed by IRT into sentence difficulty and speaker quality. In the experiments, the proposed framework was applied in a test set produced by adopting text-to-speech tools, with diverse speakers and sentences. Additionally, noise injection was applied to produce test items with even higher difficulty levels. In the experiments, noise injection actually increases difficulty and generates a wide variety of speeches to assess ASR performance. However, it is essential to pay attention that high noise levels can lead to an unreliable evaluation. The proposed plots were helpful for both identifying robust ASR systems as well as for choosing the noise level that results in both diversity and reliability.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101817"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000427","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Evaluating the performance of Automatic Speech Recognition (ASR) systems is very relevant for selecting good techniques and understanding their advantages and limitations. ASR systems are usually evaluated by adopting test sets of audio speeches, ideally with different difficulty levels. In this sense, it is important to analyse whether a system under test correctly transcribes easy test speeches, while being robust to the most difficult ones. In this paper, a novel framework is proposed for evaluating ASR systems, which covers two complementary issues: (1) to measure the difficulty of each test speech; and (2) to analyse each ASR system’s performance against the difficulty level. Regarding the first issue, the framework measures speech difficulty by adopting Item Response Theory (IRT). Regarding the second issue, the Recognizer Characteristic Curve (RCC) is proposed, which is a plot of the ASR system’s performance versus speech difficulty. ASR performance is further analysed by a two-dimensional plot, in which speech difficulty is decomposed by IRT into sentence difficulty and speaker quality. In the experiments, the proposed framework was applied in a test set produced by adopting text-to-speech tools, with diverse speakers and sentences. Additionally, noise injection was applied to produce test items with even higher difficulty levels. In the experiments, noise injection actually increases difficulty and generates a wide variety of speeches to assess ASR performance. However, it is essential to pay attention that high noise levels can lead to an unreliable evaluation. The proposed plots were helpful for both identifying robust ASR systems as well as for choosing the noise level that results in both diversity and reliability.

查看原文本刊更多论文

基于项目反应理论的语音自动识别系统言语困难评价

评估自动语音识别（ASR）系统的性能对于选择好的技术和了解它们的优点和局限性是非常重要的。ASR系统通常通过采用音频演讲的测试集来评估，最好具有不同的难度级别。从这个意义上说，重要的是要分析一个被测系统是否正确地转录了简单的测试演讲，同时对最难的演讲保持稳健。本文提出了一个新的评估ASR系统的框架，它涵盖了两个互补的问题：(1)衡量每个测试语音的难度；(2)分析每个ASR系统在不同难度下的性能。关于第一个问题，该框架采用项目反应理论（IRT）来衡量言语困难。关于第二个问题，提出了识别器特征曲线（RCC），它是ASR系统的性能与语音困难的关系图。通过二维图进一步分析ASR性能，其中语音困难通过IRT分解为句子困难和说话者质量。在实验中，提出的框架应用于采用文本到语音工具生成的测试集，该测试集具有不同的说话者和句子。此外，噪音注入被用于产生更高难度的测试项目。在实验中，噪音注入实际上增加了难度，并产生了各种各样的语音来评估ASR的表现。然而，必须注意的是，高噪音水平可能导致不可靠的评估。所提出的图既有助于识别鲁棒ASR系统，也有助于选择导致多样性和可靠性的噪声水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.