基于多视角学习和动态阈值调整的多标签鸟声识别

IF 3.4 2区物理与天体物理 Q1 ACOUSTICS

Applied Acoustics Pub Date : 2025-07-28 DOI:10.1016/j.apacoust.2025.110943

Minghui Fang , Dengwei Wu , Wei Wu , Mengfan Fang , Yanhong Chen , Xiangzeng Kong , Chen Zhao

{"title":"基于多视角学习和动态阈值调整的多标签鸟声识别","authors":"Minghui Fang , Dengwei Wu , Wei Wu , Mengfan Fang , Yanhong Chen , Xiangzeng Kong , Chen Zhao","doi":"10.1016/j.apacoust.2025.110943","DOIUrl":null,"url":null,"abstract":"<div><div>Bird species monitoring is crucial for conservation, but overlapping vocalizations in natural environments complicate multi-label classification, affecting model performance. To address these issues, we developed the Adaptive Multi-label Attention Threshold Network (AMAT-Net) as a bird sound classification framework. AMAT-Net employs a multi-view strategy, combining bidirectional gated recurrent unit (BiGRU)-attention networks to analyze temporal features and multi-scale convolutional neural networks to extract spectral features, enabling analysis of bird sounds. Given the differences between temporal and spectral features, time-domain features capture transient changes, whereas frequency-domain features reveal spectral trends. Balancing the essential features of both without losing details is difficult. Therefore, we designed the temporal–spectral attention feature fusion (TSAFF) module to optimize feature fusion. TSAFF employs an attention-based mechanism to fuse temporal and spectral features, enhancing the cross-domain feature complementarity. Binary classification is conducted between relevant and irrelevant labels, and threshold is determined based on the results. A score-based thresholding strategy called dynamic threshold scaling was then developed. A label correlation matrix is constructed using Pearson's correlation coefficients, and the classifier's scores for instance-label pairs with high inter-label correlations are adjusted accordingly during prediction. In addition, hierarchical cross-validation is used to search for the threshold that maximizes the F1 score, dynamically optimizing the decision boundary for each species to adapt to the actual label distribution. Experimental results on a synthesized dataset of 10 bird species (including cases of 2, 3, and 4 species vocalizing simultaneously) and the public BirdCLEF+2025 dataset demonstrate that AMAT-Net achieves an accuracy of 95.54% with a macro-F1 score of 91.26% on the synthesized dataset, and an accuracy of 98.75% with a macro-F1 score of 93.14% on BirdCLEF+2025.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"240 ","pages":"Article 110943"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-label bird sound recognition based on multi-view learning and dynamic threshold adjustment\",\"authors\":\"Minghui Fang , Dengwei Wu , Wei Wu , Mengfan Fang , Yanhong Chen , Xiangzeng Kong , Chen Zhao\",\"doi\":\"10.1016/j.apacoust.2025.110943\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Bird species monitoring is crucial for conservation, but overlapping vocalizations in natural environments complicate multi-label classification, affecting model performance. To address these issues, we developed the Adaptive Multi-label Attention Threshold Network (AMAT-Net) as a bird sound classification framework. AMAT-Net employs a multi-view strategy, combining bidirectional gated recurrent unit (BiGRU)-attention networks to analyze temporal features and multi-scale convolutional neural networks to extract spectral features, enabling analysis of bird sounds. Given the differences between temporal and spectral features, time-domain features capture transient changes, whereas frequency-domain features reveal spectral trends. Balancing the essential features of both without losing details is difficult. Therefore, we designed the temporal–spectral attention feature fusion (TSAFF) module to optimize feature fusion. TSAFF employs an attention-based mechanism to fuse temporal and spectral features, enhancing the cross-domain feature complementarity. Binary classification is conducted between relevant and irrelevant labels, and threshold is determined based on the results. A score-based thresholding strategy called dynamic threshold scaling was then developed. A label correlation matrix is constructed using Pearson's correlation coefficients, and the classifier's scores for instance-label pairs with high inter-label correlations are adjusted accordingly during prediction. In addition, hierarchical cross-validation is used to search for the threshold that maximizes the F1 score, dynamically optimizing the decision boundary for each species to adapt to the actual label distribution. Experimental results on a synthesized dataset of 10 bird species (including cases of 2, 3, and 4 species vocalizing simultaneously) and the public BirdCLEF+2025 dataset demonstrate that AMAT-Net achieves an accuracy of 95.54% with a macro-F1 score of 91.26% on the synthesized dataset, and an accuracy of 98.75% with a macro-F1 score of 93.14% on BirdCLEF+2025.</div></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":\"240 \",\"pages\":\"Article 110943\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X25004153\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25004153","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

鸟类物种监测对保护至关重要，但自然环境中重叠的叫声使多标签分类复杂化，影响了模型的性能。为了解决这些问题，我们开发了自适应多标签注意阈值网络（AMAT-Net）作为鸟类声音分类框架。AMAT-Net采用多视角策略，结合双向门控循环单元（BiGRU）注意网络分析时间特征和多尺度卷积神经网络提取频谱特征，从而实现对鸟类声音的分析。考虑到时间和频谱特征之间的差异，时域特征捕捉瞬态变化，而频域特征揭示频谱趋势。在不丢失细节的情况下平衡两者的基本功能是很困难的。为此，我们设计了时间-光谱关注特征融合（TSAFF）模块，对特征融合进行优化。TSAFF采用基于注意力的机制融合时间和频谱特征，增强了跨域特征的互补性。对相关标签和不相关标签进行二值分类，并根据分类结果确定阈值。然后开发了一种基于分数的阈值策略，称为动态阈值缩放。使用Pearson相关系数构建标签相关矩阵，并在预测过程中相应地调整具有高标签间相关性的实例-标签对的分类器分数。此外，采用分层交叉验证的方法寻找F1得分最大的阈值，动态优化各物种的决策边界，以适应实际的标签分布。在10种鸟类（包括2、3、4种同时发声的情况）合成数据集和BirdCLEF+2025公开数据集上的实验结果表明，AMAT-Net在合成数据集上的准确率为95.54%，宏观f1得分为91.26%；在BirdCLEF+2025上的准确率为98.75%，宏观f1得分为93.14%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-label bird sound recognition based on multi-view learning and dynamic threshold adjustment

Bird species monitoring is crucial for conservation, but overlapping vocalizations in natural environments complicate multi-label classification, affecting model performance. To address these issues, we developed the Adaptive Multi-label Attention Threshold Network (AMAT-Net) as a bird sound classification framework. AMAT-Net employs a multi-view strategy, combining bidirectional gated recurrent unit (BiGRU)-attention networks to analyze temporal features and multi-scale convolutional neural networks to extract spectral features, enabling analysis of bird sounds. Given the differences between temporal and spectral features, time-domain features capture transient changes, whereas frequency-domain features reveal spectral trends. Balancing the essential features of both without losing details is difficult. Therefore, we designed the temporal–spectral attention feature fusion (TSAFF) module to optimize feature fusion. TSAFF employs an attention-based mechanism to fuse temporal and spectral features, enhancing the cross-domain feature complementarity. Binary classification is conducted between relevant and irrelevant labels, and threshold is determined based on the results. A score-based thresholding strategy called dynamic threshold scaling was then developed. A label correlation matrix is constructed using Pearson's correlation coefficients, and the classifier's scores for instance-label pairs with high inter-label correlations are adjusted accordingly during prediction. In addition, hierarchical cross-validation is used to search for the threshold that maximizes the F1 score, dynamically optimizing the decision boundary for each species to adapt to the actual label distribution. Experimental results on a synthesized dataset of 10 bird species (including cases of 2, 3, and 4 species vocalizing simultaneously) and the public BirdCLEF+2025 dataset demonstrate that AMAT-Net achieves an accuracy of 95.54% with a macro-F1 score of 91.26% on the synthesized dataset, and an accuracy of 98.75% with a macro-F1 score of 93.14% on BirdCLEF+2025.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Acoustics 物理-声学

CiteScore

7.40

自引率

11.80%

发文量

618

审稿时长

7.5 months

期刊介绍： Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense. Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems. Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.