Estimating band importance for environmental sound recognition using deep learninga).

IF 2.3 2区物理与天体物理 Q2 ACOUSTICS

Journal of the Acoustical Society of America Pub Date : 2026-05-01 DOI:10.1121/10.0043736

Eric M Johnson, Eric W Healy

{"title":"Estimating band importance for environmental sound recognition using deep learninga).","authors":"Eric M Johnson, Eric W Healy","doi":"10.1121/10.0043736","DOIUrl":null,"url":null,"abstract":"<p><p>Environmental sound recognition (ESR) enables listeners to interpret complex acoustic environments, yet the frequency regions that support recognition are poorly understood. This study used deep learning to model ESR in competing speech and estimate frequency band-importance functions (BIFs) underlying recognition performance. Trial-level responses were collected from 46 listeners who identified 25 everyday sounds mixed with speech across a wide range of target-to-masker ratios. Two model variants were evaluated: one trained to mimic human performance, which was trained on soft labels derived from listener responses, and one trained for maximum accuracy, which was trained on ground-truth correct sound labels, enabling a direct comparison between perceptually driven and task-optimal band-importance patterns. The human-trained model closely reproduced key features of human performance, whereas the ground-truth-trained model exceeded human accuracy and showed highly reliable performance across cross-validation folds. BIFs were estimated by bandstop filtering the target signal and quantifying the resulting drop in recognition accuracy. Both model variants yielded reproducible BIFs with five prominent peaks (∼0.43, 0.77, 1.46, 2.6, and 9.7 kHz), largely driven by subsets of sounds having sharply tuned spectral dependence. This convergence across training objectives suggests that human performance closely reflects the task-optimal frequencies for segregating environmental sounds from speech maskers.</p>","PeriodicalId":17168,"journal":{"name":"Journal of the Acoustical Society of America","volume":"159 5","pages":"3804-3818"},"PeriodicalIF":2.3000,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Acoustical Society of America","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1121/10.0043736","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Environmental sound recognition (ESR) enables listeners to interpret complex acoustic environments, yet the frequency regions that support recognition are poorly understood. This study used deep learning to model ESR in competing speech and estimate frequency band-importance functions (BIFs) underlying recognition performance. Trial-level responses were collected from 46 listeners who identified 25 everyday sounds mixed with speech across a wide range of target-to-masker ratios. Two model variants were evaluated: one trained to mimic human performance, which was trained on soft labels derived from listener responses, and one trained for maximum accuracy, which was trained on ground-truth correct sound labels, enabling a direct comparison between perceptually driven and task-optimal band-importance patterns. The human-trained model closely reproduced key features of human performance, whereas the ground-truth-trained model exceeded human accuracy and showed highly reliable performance across cross-validation folds. BIFs were estimated by bandstop filtering the target signal and quantifying the resulting drop in recognition accuracy. Both model variants yielded reproducible BIFs with five prominent peaks (∼0.43, 0.77, 1.46, 2.6, and 9.7 kHz), largely driven by subsets of sounds having sharply tuned spectral dependence. This convergence across training objectives suggests that human performance closely reflects the task-optimal frequencies for segregating environmental sounds from speech maskers.

查看原文本刊更多论文

使用深度学习估计环境声音识别的频带重要性[a]。

环境声音识别（ESR）使听者能够解释复杂的声音环境，但支持识别的频率区域却知之甚少。本研究使用深度学习来模拟竞争语音中的ESR，并估计频带重要性函数（bif）的识别性能。从46名听众中收集了试验级的回答，他们识别了25种日常声音，这些声音与语音混合在很大范围内的目标与掩蔽比。评估了两种模型变体：一种是模拟人类表演的训练，它是在听众反应中获得的软标签上训练的，另一种是为了最大准确性而训练的，它是在真实正确的声音标签上训练的，从而可以直接比较感知驱动和任务最佳频带重要性模式。人工训练的模型密切再现了人类表现的关键特征，而基于事实训练的模型超过了人类的准确性，并在交叉验证折叠中表现出高度可靠的性能。通过对目标信号进行带阻滤波并量化识别精度下降的结果来估计fif。两种模型变体都产生了具有5个显著峰（~ 0.43、0.77、1.46、2.6和9.7 kHz）的可重复的bif，这主要是由具有急剧调谐的频谱依赖性的声音子集驱动的。训练目标之间的这种趋同表明，人类的表现密切反映了将环境声音与语音掩蔽器分离的任务最佳频率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the Acoustical Society of America 物理-声学

CiteScore

4.60

自引率

16.70%

发文量

1433

审稿时长

4.7 months

期刊介绍： Since 1929 The Journal of the Acoustical Society of America has been the leading source of theoretical and experimental research results in the broad interdisciplinary study of sound. Subject coverage includes: linear and nonlinear acoustics; aeroacoustics, underwater sound and acoustical oceanography; ultrasonics and quantum acoustics; architectural and structural acoustics and vibration; speech, music and noise; psychology and physiology of hearing; engineering acoustics, transduction; bioacoustics, animal bioacoustics.