Ayden M. Cauchi , Jaina Negandhi , Sharon L. Cushing , Karen A. Gordon
{"title":"Automatic speech recognition technology to evaluate an audiometric word recognition test: A preliminary investigation","authors":"Ayden M. Cauchi , Jaina Negandhi , Sharon L. Cushing , Karen A. Gordon","doi":"10.1016/j.specom.2025.103270","DOIUrl":null,"url":null,"abstract":"<div><div>This study investigated the ability of machine learning systems to score a clinical speech perception test in which monosyllabic words are heard and repeated by a listener. The accuracy score is used in audiometric assessments, including cochlear implant candidacy and monitoring. Scoring is performed by clinicians who listen and judge responses, which can create inter-rater variability and takes clinical time. A machine learning approach could support this testing by providing increased reliability and time efficiency, particularly in children. This study focused on the Phonetically Balanced Kindergarten (PBK) word list. Spoken responses (<em>n</em>=1200) were recorded from 12 adults with normal hearing. These words were presented to 3 automatic speech recognizers (Whisper large, Whisper medium, Ursa) and 7 humans in 7 conditions: unaltered or, to simulate potential speech errors, altered by first or last consonant deletion or low pass filtering at 1, 2, 4, and 6 kHz (<em>n</em>=6972 altered responses). Responses were scored as the same or different from the unaltered target. Data revealed that automatic speech recognizers (ASRs) correctly classified unaltered words similarly to human evaluators across conditions [mean ± 1 SE: Whisper large = 88.20 % ± 1.52 %; Whisper medium = 81.20 % ± 1.52 %; Ursa = 90.70 % ± 1.52 %; humans = 91.80 % ± 2.16 %], [F(3, 3866.2) = 23.63, <em>p</em><0.001]. Classifications different from the unaltered target occurred most frequently for the first consonant deletion and 1 kHz filtering conditions. Fleiss Kappa metrics showed that ASRs displayed higher agreement than human evaluators across unaltered (ASRs = 0.69; humans = 0.17) and altered (ASRs = 0.56; humans = 0.51) PBK words. These results support the further development of automatic speech recognition systems to support speech perception testing.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103270"},"PeriodicalIF":2.4000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325000858","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
This study investigated the ability of machine learning systems to score a clinical speech perception test in which monosyllabic words are heard and repeated by a listener. The accuracy score is used in audiometric assessments, including cochlear implant candidacy and monitoring. Scoring is performed by clinicians who listen and judge responses, which can create inter-rater variability and takes clinical time. A machine learning approach could support this testing by providing increased reliability and time efficiency, particularly in children. This study focused on the Phonetically Balanced Kindergarten (PBK) word list. Spoken responses (n=1200) were recorded from 12 adults with normal hearing. These words were presented to 3 automatic speech recognizers (Whisper large, Whisper medium, Ursa) and 7 humans in 7 conditions: unaltered or, to simulate potential speech errors, altered by first or last consonant deletion or low pass filtering at 1, 2, 4, and 6 kHz (n=6972 altered responses). Responses were scored as the same or different from the unaltered target. Data revealed that automatic speech recognizers (ASRs) correctly classified unaltered words similarly to human evaluators across conditions [mean ± 1 SE: Whisper large = 88.20 % ± 1.52 %; Whisper medium = 81.20 % ± 1.52 %; Ursa = 90.70 % ± 1.52 %; humans = 91.80 % ± 2.16 %], [F(3, 3866.2) = 23.63, p<0.001]. Classifications different from the unaltered target occurred most frequently for the first consonant deletion and 1 kHz filtering conditions. Fleiss Kappa metrics showed that ASRs displayed higher agreement than human evaluators across unaltered (ASRs = 0.69; humans = 0.17) and altered (ASRs = 0.56; humans = 0.51) PBK words. These results support the further development of automatic speech recognition systems to support speech perception testing.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.