{"title":"Evaluating metric and contrastive learning in pretrained models for environmental sound classification","authors":"Feilong Chen , Zhenjun Zhu , Chengli Sun , Linqing Xia","doi":"10.1016/j.apacoust.2025.110593","DOIUrl":null,"url":null,"abstract":"<div><div>Environmental Sound Classification (ESC) has advanced significantly with the advent of deep learning techniques. This study conducts a comprehensive evaluation of contrastive and metric learning approaches in ESC, introducing the ESC51 dataset, an extension of the ESC50 benchmark that incorporates noise samples from quadrotor Unmanned Aerial Vehicles (UAVs). To enhance classification performance and the discriminative power of embedding spaces, we propose a novel metric learning-based approach, SoundMLR, which employs a hybrid loss function emphasizing metric learning principles. Experimental results demonstrate that SoundMLR consistently outperforms contrastive learning methods in terms of classification accuracy and inference latency, particularly when applied to the lightweight MobileNetV2 pretrained model across ESC50, ESC51, and UrbanSound8K (US8K) datasets. Analyses of confusion matrices and t-SNE visualizations further highlight SoundMLR’s ability to generate compact, distinct feature clusters, enabling more robust discrimination between sound classes. Additionally, we introduce two innovative modules, Spectral Pooling Attention (SPA) and the Feature Pooling Layer (FPL), designed to optimize the MobileNetV2 backbone. Notably, the MobileNetV2 + FPL model, equipped with SoundMLR, achieves an impressive 92.16 % classification accuracy on the ESC51 dataset while reducing computational complexity by 24.5 %. Similarly, the MobileNetV2 + SPA model achieves a peak accuracy of 91.75 % on the ESC50 dataset, showcasing the complementary strengths of these modules. These findings offer valuable insights for the future development of efficient, scalable, and robust ESC systems. The source code for this study is publicly available at <span><span>https://github.com/flchenwhu/ESC-SoundMLR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"232 ","pages":"Article 110593"},"PeriodicalIF":3.4000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25000659","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Environmental Sound Classification (ESC) has advanced significantly with the advent of deep learning techniques. This study conducts a comprehensive evaluation of contrastive and metric learning approaches in ESC, introducing the ESC51 dataset, an extension of the ESC50 benchmark that incorporates noise samples from quadrotor Unmanned Aerial Vehicles (UAVs). To enhance classification performance and the discriminative power of embedding spaces, we propose a novel metric learning-based approach, SoundMLR, which employs a hybrid loss function emphasizing metric learning principles. Experimental results demonstrate that SoundMLR consistently outperforms contrastive learning methods in terms of classification accuracy and inference latency, particularly when applied to the lightweight MobileNetV2 pretrained model across ESC50, ESC51, and UrbanSound8K (US8K) datasets. Analyses of confusion matrices and t-SNE visualizations further highlight SoundMLR’s ability to generate compact, distinct feature clusters, enabling more robust discrimination between sound classes. Additionally, we introduce two innovative modules, Spectral Pooling Attention (SPA) and the Feature Pooling Layer (FPL), designed to optimize the MobileNetV2 backbone. Notably, the MobileNetV2 + FPL model, equipped with SoundMLR, achieves an impressive 92.16 % classification accuracy on the ESC51 dataset while reducing computational complexity by 24.5 %. Similarly, the MobileNetV2 + SPA model achieves a peak accuracy of 91.75 % on the ESC50 dataset, showcasing the complementary strengths of these modules. These findings offer valuable insights for the future development of efficient, scalable, and robust ESC systems. The source code for this study is publicly available at https://github.com/flchenwhu/ESC-SoundMLR.
期刊介绍:
Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense.
Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems.
Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.