{"title":"Estimating band importance for environmental sound recognition using deep learninga).","authors":"Eric M Johnson, Eric W Healy","doi":"10.1121/10.0043736","DOIUrl":null,"url":null,"abstract":"<p><p>Environmental sound recognition (ESR) enables listeners to interpret complex acoustic environments, yet the frequency regions that support recognition are poorly understood. This study used deep learning to model ESR in competing speech and estimate frequency band-importance functions (BIFs) underlying recognition performance. Trial-level responses were collected from 46 listeners who identified 25 everyday sounds mixed with speech across a wide range of target-to-masker ratios. Two model variants were evaluated: one trained to mimic human performance, which was trained on soft labels derived from listener responses, and one trained for maximum accuracy, which was trained on ground-truth correct sound labels, enabling a direct comparison between perceptually driven and task-optimal band-importance patterns. The human-trained model closely reproduced key features of human performance, whereas the ground-truth-trained model exceeded human accuracy and showed highly reliable performance across cross-validation folds. BIFs were estimated by bandstop filtering the target signal and quantifying the resulting drop in recognition accuracy. Both model variants yielded reproducible BIFs with five prominent peaks (∼0.43, 0.77, 1.46, 2.6, and 9.7 kHz), largely driven by subsets of sounds having sharply tuned spectral dependence. This convergence across training objectives suggests that human performance closely reflects the task-optimal frequencies for segregating environmental sounds from speech maskers.</p>","PeriodicalId":17168,"journal":{"name":"Journal of the Acoustical Society of America","volume":"159 5","pages":"3804-3818"},"PeriodicalIF":2.3000,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Acoustical Society of America","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1121/10.0043736","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Environmental sound recognition (ESR) enables listeners to interpret complex acoustic environments, yet the frequency regions that support recognition are poorly understood. This study used deep learning to model ESR in competing speech and estimate frequency band-importance functions (BIFs) underlying recognition performance. Trial-level responses were collected from 46 listeners who identified 25 everyday sounds mixed with speech across a wide range of target-to-masker ratios. Two model variants were evaluated: one trained to mimic human performance, which was trained on soft labels derived from listener responses, and one trained for maximum accuracy, which was trained on ground-truth correct sound labels, enabling a direct comparison between perceptually driven and task-optimal band-importance patterns. The human-trained model closely reproduced key features of human performance, whereas the ground-truth-trained model exceeded human accuracy and showed highly reliable performance across cross-validation folds. BIFs were estimated by bandstop filtering the target signal and quantifying the resulting drop in recognition accuracy. Both model variants yielded reproducible BIFs with five prominent peaks (∼0.43, 0.77, 1.46, 2.6, and 9.7 kHz), largely driven by subsets of sounds having sharply tuned spectral dependence. This convergence across training objectives suggests that human performance closely reflects the task-optimal frequencies for segregating environmental sounds from speech maskers.
期刊介绍:
Since 1929 The Journal of the Acoustical Society of America has been the leading source of theoretical and experimental research results in the broad interdisciplinary study of sound. Subject coverage includes: linear and nonlinear acoustics; aeroacoustics, underwater sound and acoustical oceanography; ultrasonics and quantum acoustics; architectural and structural acoustics and vibration; speech, music and noise; psychology and physiology of hearing; engineering acoustics, transduction; bioacoustics, animal bioacoustics.