Automatic speech recognition technology to evaluate an audiometric word recognition test: A preliminary investigation

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2025-06-20 DOI:10.1016/j.specom.2025.103270

Ayden M. Cauchi , Jaina Negandhi , Sharon L. Cushing , Karen A. Gordon

{"title":"Automatic speech recognition technology to evaluate an audiometric word recognition test: A preliminary investigation","authors":"Ayden M. Cauchi , Jaina Negandhi , Sharon L. Cushing , Karen A. Gordon","doi":"10.1016/j.specom.2025.103270","DOIUrl":null,"url":null,"abstract":"<div><div>This study investigated the ability of machine learning systems to score a clinical speech perception test in which monosyllabic words are heard and repeated by a listener. The accuracy score is used in audiometric assessments, including cochlear implant candidacy and monitoring. Scoring is performed by clinicians who listen and judge responses, which can create inter-rater variability and takes clinical time. A machine learning approach could support this testing by providing increased reliability and time efficiency, particularly in children. This study focused on the Phonetically Balanced Kindergarten (PBK) word list. Spoken responses (<em>n</em>=1200) were recorded from 12 adults with normal hearing. These words were presented to 3 automatic speech recognizers (Whisper large, Whisper medium, Ursa) and 7 humans in 7 conditions: unaltered or, to simulate potential speech errors, altered by first or last consonant deletion or low pass filtering at 1, 2, 4, and 6 kHz (<em>n</em>=6972 altered responses). Responses were scored as the same or different from the unaltered target. Data revealed that automatic speech recognizers (ASRs) correctly classified unaltered words similarly to human evaluators across conditions [mean ± 1 SE: Whisper large = 88.20 % ± 1.52 %; Whisper medium = 81.20 % ± 1.52 %; Ursa = 90.70 % ± 1.52 %; humans = 91.80 % ± 2.16 %], [F(3, 3866.2) = 23.63, <em>p</em><0.001]. Classifications different from the unaltered target occurred most frequently for the first consonant deletion and 1 kHz filtering conditions. Fleiss Kappa metrics showed that ASRs displayed higher agreement than human evaluators across unaltered (ASRs = 0.69; humans = 0.17) and altered (ASRs = 0.56; humans = 0.51) PBK words. These results support the further development of automatic speech recognition systems to support speech perception testing.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103270"},"PeriodicalIF":2.4000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325000858","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

This study investigated the ability of machine learning systems to score a clinical speech perception test in which monosyllabic words are heard and repeated by a listener. The accuracy score is used in audiometric assessments, including cochlear implant candidacy and monitoring. Scoring is performed by clinicians who listen and judge responses, which can create inter-rater variability and takes clinical time. A machine learning approach could support this testing by providing increased reliability and time efficiency, particularly in children. This study focused on the Phonetically Balanced Kindergarten (PBK) word list. Spoken responses (n=1200) were recorded from 12 adults with normal hearing. These words were presented to 3 automatic speech recognizers (Whisper large, Whisper medium, Ursa) and 7 humans in 7 conditions: unaltered or, to simulate potential speech errors, altered by first or last consonant deletion or low pass filtering at 1, 2, 4, and 6 kHz (n=6972 altered responses). Responses were scored as the same or different from the unaltered target. Data revealed that automatic speech recognizers (ASRs) correctly classified unaltered words similarly to human evaluators across conditions [mean ± 1 SE: Whisper large = 88.20 % ± 1.52 %; Whisper medium = 81.20 % ± 1.52 %; Ursa = 90.70 % ± 1.52 %; humans = 91.80 % ± 2.16 %], [F(3, 3866.2) = 23.63, p<0.001]. Classifications different from the unaltered target occurred most frequently for the first consonant deletion and 1 kHz filtering conditions. Fleiss Kappa metrics showed that ASRs displayed higher agreement than human evaluators across unaltered (ASRs = 0.69; humans = 0.17) and altered (ASRs = 0.56; humans = 0.51) PBK words. These results support the further development of automatic speech recognition systems to support speech perception testing.

查看原文本刊更多论文

自动语音识别技术评价听测词识别测试：初步研究

本研究调查了机器学习系统在临床语音感知测试中的得分能力，在该测试中，听者听到并重复单音节单词。准确度评分用于听力评估，包括人工耳蜗候选和监测。评分是由听取和判断反应的临床医生进行的，这可能会造成评分者之间的差异，并占用临床时间。机器学习方法可以通过提供更高的可靠性和时间效率来支持这种测试，特别是在儿童中。本研究以语音平衡幼儿园（PBK）词表为研究对象。记录了12名听力正常的成年人的口语回答（n=1200）。这些单词被呈现给3个自动语音识别器（Whisper large, Whisper medium, Ursa）和7个人，在7种条件下：未改变或模拟潜在的语音错误，通过删除第一个或最后一个辅音或在1,2,4和6 kHz进行低通滤波（n=6972个改变的响应）。回答与未改变的目标相同或不同。数据显示，自动语音识别器（ASRs）在不同条件下对未改变单词的正确分类与人类评估者相似[平均±1 SE: Whisper large = 88.20%±1.52%；耳语介质= 81.20%±1.52%；熊座= 90.70%±1.52%；人类= 91.80%±2.16%],[F(3866.2) = 23.63,术中;0.001]。在第一个辅音缺失和1 kHz过滤条件下，与未改变目标不同的分类最常见。Fleiss Kappa指标显示，在未改变的情况下，ASRs比人类评估者表现出更高的一致性(ASRs = 0.69；人类= 0.17)和变异(ASRs = 0.56；人类= 0.51)PBK单词。这些结果支持了自动语音识别系统的进一步发展，以支持语音感知测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.