Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance.

IF 2.3 2区物理与天体物理 Q2 ACOUSTICS

Journal of the Acoustical Society of America Pub Date : 2025-04-01 DOI:10.1121/10.0036498

Abdullah Olcay, Paul R White, Jonathan M Bull, Denise Risch, Benedict Dell, Ellen L White

{"title":"Sounds of the deep: How input representation, model choice, and dataset size influence underwater sound classification performance.","authors":"Abdullah Olcay, Paul R White, Jonathan M Bull, Denise Risch, Benedict Dell, Ellen L White","doi":"10.1121/10.0036498","DOIUrl":null,"url":null,"abstract":"<p><p>Convolutional neural networks (CNNs) have proven highly effective in automatically identifying and classifying underwater sound sources, enabling efficient analysis of marine environments. This work examines two key design choices for a CNN classifier: input representation and network architecture, analyzing their importance as training data size varies and their effectiveness in generalizing between sites. Passive acoustic data from three offshore sites in Western Scotland were used for hierarchical classification; categorizing sounds into one of four classes: delphinid tonal, delphinid clicks, vessels, and ambient noise. Three different input representations of the acoustic signals were investigated along with four CNN architectures, including three pre-trained for image classification tasks. Experiments show that a custom-built shallow CNN can outperform more complex ar chitectures if the input representation is chosen appropriately. For example, a shallow CNN using Mel-spectrogram normalised with per channel energy normalization (MS-PCEN) achieved a 12.5% accuracy improvement over a ResNet model when small amounts of training data are available. Studying model performance across the three sites demonstrates that input representation is an important factor for achieving robust results between sites, with MS-PCEN achieving the best performance. However, the importance of the choice of input representation decreases as the training dataset size increases.</p>","PeriodicalId":17168,"journal":{"name":"Journal of the Acoustical Society of America","volume":"157 4","pages":"3017-3032"},"PeriodicalIF":2.3000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Acoustical Society of America","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1121/10.0036498","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Convolutional neural networks (CNNs) have proven highly effective in automatically identifying and classifying underwater sound sources, enabling efficient analysis of marine environments. This work examines two key design choices for a CNN classifier: input representation and network architecture, analyzing their importance as training data size varies and their effectiveness in generalizing between sites. Passive acoustic data from three offshore sites in Western Scotland were used for hierarchical classification; categorizing sounds into one of four classes: delphinid tonal, delphinid clicks, vessels, and ambient noise. Three different input representations of the acoustic signals were investigated along with four CNN architectures, including three pre-trained for image classification tasks. Experiments show that a custom-built shallow CNN can outperform more complex ar chitectures if the input representation is chosen appropriately. For example, a shallow CNN using Mel-spectrogram normalised with per channel energy normalization (MS-PCEN) achieved a 12.5% accuracy improvement over a ResNet model when small amounts of training data are available. Studying model performance across the three sites demonstrates that input representation is an important factor for achieving robust results between sites, with MS-PCEN achieving the best performance. However, the importance of the choice of input representation decreases as the training dataset size increases.

查看原文本刊更多论文

深海声音：输入表示、模型选择和数据集大小如何影响水下声音分类性能。

卷积神经网络（cnn）已被证明在自动识别和分类水下声源方面非常有效，能够有效地分析海洋环境。这项工作研究了CNN分类器的两个关键设计选择：输入表示和网络架构，分析了它们在训练数据大小变化时的重要性，以及它们在站点之间泛化的有效性。来自苏格兰西部三个海上站点的被动声学数据用于分层分类；将声音分为四类：海豚的音调，海豚的咔哒声，血管和环境噪音。研究了声学信号的三种不同输入表示以及四种CNN架构，其中包括三种用于图像分类任务的预训练。实验表明，如果选择适当的输入表示，定制的浅CNN可以优于更复杂的ar结构。例如，当少量训练数据可用时，使用mel -谱图归一化和每通道能量归一化（MS-PCEN）的浅CNN比ResNet模型的准确率提高了12.5%。研究三个站点的模型性能表明，输入表示是实现站点之间鲁棒性结果的重要因素，MS-PCEN达到了最佳性能。然而，输入表示的选择的重要性随着训练数据集大小的增加而降低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the Acoustical Society of America 物理-声学

CiteScore

4.60

自引率

16.70%

发文量

1433

审稿时长

4.7 months

期刊介绍： Since 1929 The Journal of the Acoustical Society of America has been the leading source of theoretical and experimental research results in the broad interdisciplinary study of sound. Subject coverage includes: linear and nonlinear acoustics; aeroacoustics, underwater sound and acoustical oceanography; ultrasonics and quantum acoustics; architectural and structural acoustics and vibration; speech, music and noise; psychology and physiology of hearing; engineering acoustics, transduction; bioacoustics, animal bioacoustics.