{"title":"DFEFM: Fusing frequency correlation and mel features for robust edge bird audio detection","authors":"Yingqi Wang , Luyang Zhang , Jiangjian Xie , Junguo Zhang , Rui Zhu","doi":"10.1016/j.avrs.2025.100232","DOIUrl":null,"url":null,"abstract":"<div><div>Passive acoustic monitoring (PAM) technology is increasingly becoming one of the mainstream methods for bird monitoring. However, detecting bird audio within complex natural acoustic environments using PAM devices remains a significant challenge. To enhance the accuracy (ACC) of bird audio detection (BAD) and reduce both false negatives and false positives, this study proposes a BAD method based on a Dual-Feature Enhancement Fusion Model (DFEFM). This method incorporates per-channel energy normalization (PCEN) to suppress noise in the input audio and utilizes mel-frequency cepstral coefficients (MFCC) and frequency correlation matrices (FCM) as input features. It achieves deep feature-level fusion of MFCC and FCM on the channel dimension through two independent multi-layer convolutional network branches, and further integrates Spatial and Channel Synergistic Attention (SCSA) and Multi-Head Attention (MHA) modules to enhance the fusion effect of the aforementioned two deep features. Experimental results on the DCASE2018 BAD dataset show that our proposed method achieved an ACC of 91.4% and an AUC value of 0.963, with false negative and false positive rates of 11.36% and 7.40%, respectively, surpassing existing methods. The method also demonstrated detection ACC above 92% and AUC values above 0.987 on datasets from three sites of different natural scenes in Beijing. Testing on the NVIDIA Jetson Nano indicated that the method achieved an ACC of 89.48% when processing an average of 10 s of audio, with a response time of only 0.557 s, showing excellent processing efficiency. This study provides an effective method for filtering non-bird vocalization audio in bird vocalization monitoring devices, which helps to save edge storage and information transmission costs, and has significant application value for wild bird monitoring and ecological research.</div></div>","PeriodicalId":51311,"journal":{"name":"Avian Research","volume":"16 2","pages":"Article 100232"},"PeriodicalIF":1.6000,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Avian Research","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2053716625000118","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORNITHOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Passive acoustic monitoring (PAM) technology is increasingly becoming one of the mainstream methods for bird monitoring. However, detecting bird audio within complex natural acoustic environments using PAM devices remains a significant challenge. To enhance the accuracy (ACC) of bird audio detection (BAD) and reduce both false negatives and false positives, this study proposes a BAD method based on a Dual-Feature Enhancement Fusion Model (DFEFM). This method incorporates per-channel energy normalization (PCEN) to suppress noise in the input audio and utilizes mel-frequency cepstral coefficients (MFCC) and frequency correlation matrices (FCM) as input features. It achieves deep feature-level fusion of MFCC and FCM on the channel dimension through two independent multi-layer convolutional network branches, and further integrates Spatial and Channel Synergistic Attention (SCSA) and Multi-Head Attention (MHA) modules to enhance the fusion effect of the aforementioned two deep features. Experimental results on the DCASE2018 BAD dataset show that our proposed method achieved an ACC of 91.4% and an AUC value of 0.963, with false negative and false positive rates of 11.36% and 7.40%, respectively, surpassing existing methods. The method also demonstrated detection ACC above 92% and AUC values above 0.987 on datasets from three sites of different natural scenes in Beijing. Testing on the NVIDIA Jetson Nano indicated that the method achieved an ACC of 89.48% when processing an average of 10 s of audio, with a response time of only 0.557 s, showing excellent processing efficiency. This study provides an effective method for filtering non-bird vocalization audio in bird vocalization monitoring devices, which helps to save edge storage and information transmission costs, and has significant application value for wild bird monitoring and ecological research.
期刊介绍:
Avian Research is an open access, peer-reviewed journal publishing high quality research and review articles on all aspects of ornithology from all over the world. It aims to report the latest and most significant progress in ornithology and to encourage exchange of ideas among international ornithologists. As an open access journal, Avian Research provides a unique opportunity to publish high quality contents that will be internationally accessible to any reader at no cost.