{"title":"Enhancing music audio signal recognition through CNN-BiLSTM fusion with De-noising autoencoder for improved performance","authors":"Xiaoying Mao , Ye Tian , Tairan Jin , Bo Di","doi":"10.1016/j.neucom.2025.129607","DOIUrl":null,"url":null,"abstract":"<div><div>This study presents an advanced framework for music audio signal recognition that combines Convolutional Neural Networks (CNNs), Bidirectional Long Short-Term Memory (BiLSTM) networks, and Noise Reduction Auto-encoder models to significantly improve accuracy and robustness. The core innovation is a novel noise reduction auto-encoder that integrates CNN and BiLSTM architectures, enabling superior recognition performance under varying noise levels and environmental conditions. The proposed framework, validated on several datasets including the Zhvoice, Common Voice, and LibriSpeech, demonstrates higher accuracy compared to existing methods. In addition, an optimized CNN architecture called Faster Region-based CNN with Multi-scale Information (FRCNN-MSI) is developed for efficient speech feature extraction, which shows significant improvements in noisy environments. The BiLSTM model is further enhanced with an attention mechanism that improves sequence modeling and contextual relationship capture. Together, these advances establish our approach as a robust solution to real-world speech recognition challenges, with potential implications for improving speech recognition systems in diverse applications.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"625 ","pages":"Article 129607"},"PeriodicalIF":5.5000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225002796","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
This study presents an advanced framework for music audio signal recognition that combines Convolutional Neural Networks (CNNs), Bidirectional Long Short-Term Memory (BiLSTM) networks, and Noise Reduction Auto-encoder models to significantly improve accuracy and robustness. The core innovation is a novel noise reduction auto-encoder that integrates CNN and BiLSTM architectures, enabling superior recognition performance under varying noise levels and environmental conditions. The proposed framework, validated on several datasets including the Zhvoice, Common Voice, and LibriSpeech, demonstrates higher accuracy compared to existing methods. In addition, an optimized CNN architecture called Faster Region-based CNN with Multi-scale Information (FRCNN-MSI) is developed for efficient speech feature extraction, which shows significant improvements in noisy environments. The BiLSTM model is further enhanced with an attention mechanism that improves sequence modeling and contextual relationship capture. Together, these advances establish our approach as a robust solution to real-world speech recognition challenges, with potential implications for improving speech recognition systems in diverse applications.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.