{"title":"基于多视角学习和动态阈值调整的多标签鸟声识别","authors":"Minghui Fang , Dengwei Wu , Wei Wu , Mengfan Fang , Yanhong Chen , Xiangzeng Kong , Chen Zhao","doi":"10.1016/j.apacoust.2025.110943","DOIUrl":null,"url":null,"abstract":"<div><div>Bird species monitoring is crucial for conservation, but overlapping vocalizations in natural environments complicate multi-label classification, affecting model performance. To address these issues, we developed the Adaptive Multi-label Attention Threshold Network (AMAT-Net) as a bird sound classification framework. AMAT-Net employs a multi-view strategy, combining bidirectional gated recurrent unit (BiGRU)-attention networks to analyze temporal features and multi-scale convolutional neural networks to extract spectral features, enabling analysis of bird sounds. Given the differences between temporal and spectral features, time-domain features capture transient changes, whereas frequency-domain features reveal spectral trends. Balancing the essential features of both without losing details is difficult. Therefore, we designed the temporal–spectral attention feature fusion (TSAFF) module to optimize feature fusion. TSAFF employs an attention-based mechanism to fuse temporal and spectral features, enhancing the cross-domain feature complementarity. Binary classification is conducted between relevant and irrelevant labels, and threshold is determined based on the results. A score-based thresholding strategy called dynamic threshold scaling was then developed. A label correlation matrix is constructed using Pearson's correlation coefficients, and the classifier's scores for instance-label pairs with high inter-label correlations are adjusted accordingly during prediction. In addition, hierarchical cross-validation is used to search for the threshold that maximizes the F1 score, dynamically optimizing the decision boundary for each species to adapt to the actual label distribution. Experimental results on a synthesized dataset of 10 bird species (including cases of 2, 3, and 4 species vocalizing simultaneously) and the public BirdCLEF+2025 dataset demonstrate that AMAT-Net achieves an accuracy of 95.54% with a macro-F1 score of 91.26% on the synthesized dataset, and an accuracy of 98.75% with a macro-F1 score of 93.14% on BirdCLEF+2025.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"240 ","pages":"Article 110943"},"PeriodicalIF":3.4000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-label bird sound recognition based on multi-view learning and dynamic threshold adjustment\",\"authors\":\"Minghui Fang , Dengwei Wu , Wei Wu , Mengfan Fang , Yanhong Chen , Xiangzeng Kong , Chen Zhao\",\"doi\":\"10.1016/j.apacoust.2025.110943\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Bird species monitoring is crucial for conservation, but overlapping vocalizations in natural environments complicate multi-label classification, affecting model performance. To address these issues, we developed the Adaptive Multi-label Attention Threshold Network (AMAT-Net) as a bird sound classification framework. AMAT-Net employs a multi-view strategy, combining bidirectional gated recurrent unit (BiGRU)-attention networks to analyze temporal features and multi-scale convolutional neural networks to extract spectral features, enabling analysis of bird sounds. Given the differences between temporal and spectral features, time-domain features capture transient changes, whereas frequency-domain features reveal spectral trends. Balancing the essential features of both without losing details is difficult. Therefore, we designed the temporal–spectral attention feature fusion (TSAFF) module to optimize feature fusion. TSAFF employs an attention-based mechanism to fuse temporal and spectral features, enhancing the cross-domain feature complementarity. Binary classification is conducted between relevant and irrelevant labels, and threshold is determined based on the results. A score-based thresholding strategy called dynamic threshold scaling was then developed. A label correlation matrix is constructed using Pearson's correlation coefficients, and the classifier's scores for instance-label pairs with high inter-label correlations are adjusted accordingly during prediction. In addition, hierarchical cross-validation is used to search for the threshold that maximizes the F1 score, dynamically optimizing the decision boundary for each species to adapt to the actual label distribution. Experimental results on a synthesized dataset of 10 bird species (including cases of 2, 3, and 4 species vocalizing simultaneously) and the public BirdCLEF+2025 dataset demonstrate that AMAT-Net achieves an accuracy of 95.54% with a macro-F1 score of 91.26% on the synthesized dataset, and an accuracy of 98.75% with a macro-F1 score of 93.14% on BirdCLEF+2025.</div></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":\"240 \",\"pages\":\"Article 110943\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X25004153\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25004153","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
Multi-label bird sound recognition based on multi-view learning and dynamic threshold adjustment
Bird species monitoring is crucial for conservation, but overlapping vocalizations in natural environments complicate multi-label classification, affecting model performance. To address these issues, we developed the Adaptive Multi-label Attention Threshold Network (AMAT-Net) as a bird sound classification framework. AMAT-Net employs a multi-view strategy, combining bidirectional gated recurrent unit (BiGRU)-attention networks to analyze temporal features and multi-scale convolutional neural networks to extract spectral features, enabling analysis of bird sounds. Given the differences between temporal and spectral features, time-domain features capture transient changes, whereas frequency-domain features reveal spectral trends. Balancing the essential features of both without losing details is difficult. Therefore, we designed the temporal–spectral attention feature fusion (TSAFF) module to optimize feature fusion. TSAFF employs an attention-based mechanism to fuse temporal and spectral features, enhancing the cross-domain feature complementarity. Binary classification is conducted between relevant and irrelevant labels, and threshold is determined based on the results. A score-based thresholding strategy called dynamic threshold scaling was then developed. A label correlation matrix is constructed using Pearson's correlation coefficients, and the classifier's scores for instance-label pairs with high inter-label correlations are adjusted accordingly during prediction. In addition, hierarchical cross-validation is used to search for the threshold that maximizes the F1 score, dynamically optimizing the decision boundary for each species to adapt to the actual label distribution. Experimental results on a synthesized dataset of 10 bird species (including cases of 2, 3, and 4 species vocalizing simultaneously) and the public BirdCLEF+2025 dataset demonstrate that AMAT-Net achieves an accuracy of 95.54% with a macro-F1 score of 91.26% on the synthesized dataset, and an accuracy of 98.75% with a macro-F1 score of 93.14% on BirdCLEF+2025.
期刊介绍:
Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense.
Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems.
Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.