评估环境声音分类预训练模型中的度量和对比学习

IF 3.4 2区物理与天体物理 Q1 ACOUSTICS

Applied Acoustics Pub Date : 2025-02-11 DOI:10.1016/j.apacoust.2025.110593

Feilong Chen , Zhenjun Zhu , Chengli Sun , Linqing Xia

{"title":"评估环境声音分类预训练模型中的度量和对比学习","authors":"Feilong Chen , Zhenjun Zhu , Chengli Sun , Linqing Xia","doi":"10.1016/j.apacoust.2025.110593","DOIUrl":null,"url":null,"abstract":"<div><div>Environmental Sound Classification (ESC) has advanced significantly with the advent of deep learning techniques. This study conducts a comprehensive evaluation of contrastive and metric learning approaches in ESC, introducing the ESC51 dataset, an extension of the ESC50 benchmark that incorporates noise samples from quadrotor Unmanned Aerial Vehicles (UAVs). To enhance classification performance and the discriminative power of embedding spaces, we propose a novel metric learning-based approach, SoundMLR, which employs a hybrid loss function emphasizing metric learning principles. Experimental results demonstrate that SoundMLR consistently outperforms contrastive learning methods in terms of classification accuracy and inference latency, particularly when applied to the lightweight MobileNetV2 pretrained model across ESC50, ESC51, and UrbanSound8K (US8K) datasets. Analyses of confusion matrices and t-SNE visualizations further highlight SoundMLR’s ability to generate compact, distinct feature clusters, enabling more robust discrimination between sound classes. Additionally, we introduce two innovative modules, Spectral Pooling Attention (SPA) and the Feature Pooling Layer (FPL), designed to optimize the MobileNetV2 backbone. Notably, the MobileNetV2 + FPL model, equipped with SoundMLR, achieves an impressive 92.16 % classification accuracy on the ESC51 dataset while reducing computational complexity by 24.5 %. Similarly, the MobileNetV2 + SPA model achieves a peak accuracy of 91.75 % on the ESC50 dataset, showcasing the complementary strengths of these modules. These findings offer valuable insights for the future development of efficient, scalable, and robust ESC systems. The source code for this study is publicly available at <span><span>https://github.com/flchenwhu/ESC-SoundMLR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"232 ","pages":"Article 110593"},"PeriodicalIF":3.4000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating metric and contrastive learning in pretrained models for environmental sound classification\",\"authors\":\"Feilong Chen , Zhenjun Zhu , Chengli Sun , Linqing Xia\",\"doi\":\"10.1016/j.apacoust.2025.110593\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Environmental Sound Classification (ESC) has advanced significantly with the advent of deep learning techniques. This study conducts a comprehensive evaluation of contrastive and metric learning approaches in ESC, introducing the ESC51 dataset, an extension of the ESC50 benchmark that incorporates noise samples from quadrotor Unmanned Aerial Vehicles (UAVs). To enhance classification performance and the discriminative power of embedding spaces, we propose a novel metric learning-based approach, SoundMLR, which employs a hybrid loss function emphasizing metric learning principles. Experimental results demonstrate that SoundMLR consistently outperforms contrastive learning methods in terms of classification accuracy and inference latency, particularly when applied to the lightweight MobileNetV2 pretrained model across ESC50, ESC51, and UrbanSound8K (US8K) datasets. Analyses of confusion matrices and t-SNE visualizations further highlight SoundMLR’s ability to generate compact, distinct feature clusters, enabling more robust discrimination between sound classes. Additionally, we introduce two innovative modules, Spectral Pooling Attention (SPA) and the Feature Pooling Layer (FPL), designed to optimize the MobileNetV2 backbone. Notably, the MobileNetV2 + FPL model, equipped with SoundMLR, achieves an impressive 92.16 % classification accuracy on the ESC51 dataset while reducing computational complexity by 24.5 %. Similarly, the MobileNetV2 + SPA model achieves a peak accuracy of 91.75 % on the ESC50 dataset, showcasing the complementary strengths of these modules. These findings offer valuable insights for the future development of efficient, scalable, and robust ESC systems. The source code for this study is publicly available at <span><span>https://github.com/flchenwhu/ESC-SoundMLR</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":\"232 \",\"pages\":\"Article 110593\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-02-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X25000659\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25000659","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

随着深度学习技术的出现，环境声音分类（ESC）取得了重大进展。本研究对ESC中的对比和度量学习方法进行了全面评估，引入了ESC51数据集，该数据集是ESC50基准的扩展，包含了来自四旋翼无人机（uav）的噪声样本。为了提高嵌入空间的分类性能和判别能力，我们提出了一种新的基于度量学习的方法——SoundMLR，该方法采用了强调度量学习原理的混合损失函数。实验结果表明，在分类精度和推理延迟方面，SoundMLR始终优于对比学习方法，特别是当应用于轻量级MobileNetV2预训练模型跨ESC50、ESC51和UrbanSound8K （US8K）数据集时。对混淆矩阵和t-SNE可视化的分析进一步强调了SoundMLR生成紧凑、独特的特征簇的能力，从而在声音类别之间实现更强大的区分。此外，我们引入了两个创新模块，频谱池关注（SPA）和特征池层（FPL），旨在优化MobileNetV2骨干网。值得注意的是，配备了SoundMLR的MobileNetV2 + FPL模型在ESC51数据集上实现了令人印象深刻的92.16%的分类准确率，同时将计算复杂度降低了24.5%。同样，MobileNetV2 + SPA模型在ESC50数据集上达到了91.75%的峰值精度，显示了这些模块的互补优势。这些发现为未来高效、可扩展和健壮的ESC系统的开发提供了有价值的见解。这项研究的源代码可以在https://github.com/flchenwhu/ESC-SoundMLR上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating metric and contrastive learning in pretrained models for environmental sound classification

Environmental Sound Classification (ESC) has advanced significantly with the advent of deep learning techniques. This study conducts a comprehensive evaluation of contrastive and metric learning approaches in ESC, introducing the ESC51 dataset, an extension of the ESC50 benchmark that incorporates noise samples from quadrotor Unmanned Aerial Vehicles (UAVs). To enhance classification performance and the discriminative power of embedding spaces, we propose a novel metric learning-based approach, SoundMLR, which employs a hybrid loss function emphasizing metric learning principles. Experimental results demonstrate that SoundMLR consistently outperforms contrastive learning methods in terms of classification accuracy and inference latency, particularly when applied to the lightweight MobileNetV2 pretrained model across ESC50, ESC51, and UrbanSound8K (US8K) datasets. Analyses of confusion matrices and t-SNE visualizations further highlight SoundMLR’s ability to generate compact, distinct feature clusters, enabling more robust discrimination between sound classes. Additionally, we introduce two innovative modules, Spectral Pooling Attention (SPA) and the Feature Pooling Layer (FPL), designed to optimize the MobileNetV2 backbone. Notably, the MobileNetV2 + FPL model, equipped with SoundMLR, achieves an impressive 92.16 % classification accuracy on the ESC51 dataset while reducing computational complexity by 24.5 %. Similarly, the MobileNetV2 + SPA model achieves a peak accuracy of 91.75 % on the ESC50 dataset, showcasing the complementary strengths of these modules. These findings offer valuable insights for the future development of efficient, scalable, and robust ESC systems. The source code for this study is publicly available at https://github.com/flchenwhu/ESC-SoundMLR.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Acoustics 物理-声学

CiteScore

7.40

自引率

11.80%

发文量

618

审稿时长

7.5 months

期刊介绍： Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense. Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems. Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.