Estimating band importance for environmental sound recognition using deep learninga).

IF 2.3 2区 物理与天体物理 Q2 ACOUSTICS
Eric M Johnson, Eric W Healy
{"title":"Estimating band importance for environmental sound recognition using deep learninga).","authors":"Eric M Johnson, Eric W Healy","doi":"10.1121/10.0043736","DOIUrl":null,"url":null,"abstract":"<p><p>Environmental sound recognition (ESR) enables listeners to interpret complex acoustic environments, yet the frequency regions that support recognition are poorly understood. This study used deep learning to model ESR in competing speech and estimate frequency band-importance functions (BIFs) underlying recognition performance. Trial-level responses were collected from 46 listeners who identified 25 everyday sounds mixed with speech across a wide range of target-to-masker ratios. Two model variants were evaluated: one trained to mimic human performance, which was trained on soft labels derived from listener responses, and one trained for maximum accuracy, which was trained on ground-truth correct sound labels, enabling a direct comparison between perceptually driven and task-optimal band-importance patterns. The human-trained model closely reproduced key features of human performance, whereas the ground-truth-trained model exceeded human accuracy and showed highly reliable performance across cross-validation folds. BIFs were estimated by bandstop filtering the target signal and quantifying the resulting drop in recognition accuracy. Both model variants yielded reproducible BIFs with five prominent peaks (∼0.43, 0.77, 1.46, 2.6, and 9.7 kHz), largely driven by subsets of sounds having sharply tuned spectral dependence. This convergence across training objectives suggests that human performance closely reflects the task-optimal frequencies for segregating environmental sounds from speech maskers.</p>","PeriodicalId":17168,"journal":{"name":"Journal of the Acoustical Society of America","volume":"159 5","pages":"3804-3818"},"PeriodicalIF":2.3000,"publicationDate":"2026-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Acoustical Society of America","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1121/10.0043736","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Environmental sound recognition (ESR) enables listeners to interpret complex acoustic environments, yet the frequency regions that support recognition are poorly understood. This study used deep learning to model ESR in competing speech and estimate frequency band-importance functions (BIFs) underlying recognition performance. Trial-level responses were collected from 46 listeners who identified 25 everyday sounds mixed with speech across a wide range of target-to-masker ratios. Two model variants were evaluated: one trained to mimic human performance, which was trained on soft labels derived from listener responses, and one trained for maximum accuracy, which was trained on ground-truth correct sound labels, enabling a direct comparison between perceptually driven and task-optimal band-importance patterns. The human-trained model closely reproduced key features of human performance, whereas the ground-truth-trained model exceeded human accuracy and showed highly reliable performance across cross-validation folds. BIFs were estimated by bandstop filtering the target signal and quantifying the resulting drop in recognition accuracy. Both model variants yielded reproducible BIFs with five prominent peaks (∼0.43, 0.77, 1.46, 2.6, and 9.7 kHz), largely driven by subsets of sounds having sharply tuned spectral dependence. This convergence across training objectives suggests that human performance closely reflects the task-optimal frequencies for segregating environmental sounds from speech maskers.

使用深度学习估计环境声音识别的频带重要性[a]。
环境声音识别(ESR)使听者能够解释复杂的声音环境,但支持识别的频率区域却知之甚少。本研究使用深度学习来模拟竞争语音中的ESR,并估计频带重要性函数(bif)的识别性能。从46名听众中收集了试验级的回答,他们识别了25种日常声音,这些声音与语音混合在很大范围内的目标与掩蔽比。评估了两种模型变体:一种是模拟人类表演的训练,它是在听众反应中获得的软标签上训练的,另一种是为了最大准确性而训练的,它是在真实正确的声音标签上训练的,从而可以直接比较感知驱动和任务最佳频带重要性模式。人工训练的模型密切再现了人类表现的关键特征,而基于事实训练的模型超过了人类的准确性,并在交叉验证折叠中表现出高度可靠的性能。通过对目标信号进行带阻滤波并量化识别精度下降的结果来估计fif。两种模型变体都产生了具有5个显著峰(~ 0.43、0.77、1.46、2.6和9.7 kHz)的可重复的bif,这主要是由具有急剧调谐的频谱依赖性的声音子集驱动的。训练目标之间的这种趋同表明,人类的表现密切反映了将环境声音与语音掩蔽器分离的任务最佳频率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
4.60
自引率
16.70%
发文量
1433
审稿时长
4.7 months
期刊介绍: Since 1929 The Journal of the Acoustical Society of America has been the leading source of theoretical and experimental research results in the broad interdisciplinary study of sound. Subject coverage includes: linear and nonlinear acoustics; aeroacoustics, underwater sound and acoustical oceanography; ultrasonics and quantum acoustics; architectural and structural acoustics and vibration; speech, music and noise; psychology and physiology of hearing; engineering acoustics, transduction; bioacoustics, animal bioacoustics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书