{"title":"Self-supervised random forests for robust voice activity detection with limited labeled data","authors":"Manjiri Bhat , R.B. Keskar","doi":"10.1016/j.apacoust.2025.110636","DOIUrl":null,"url":null,"abstract":"<div><div>Voice activity detection is essential for various downstream speech-related applications. Existing deep learning models for voice activity detection and speech recognition are available, but they often require substantial annotated data and assume noise-free environments. This limitation hinders their application to the vast but sparsely labeled audio datasets available. To address this gap, we propose a novel approach: self-supervised random forest voice activity detection (SSRF-VAD), designed for noisy environments and limited labeled data. We integrate a set of five handcrafted features to optimize performance under mixed signal-to-noise ratios (SNRs). The study incorporates various noise classes covering diverse environmental sounds such as urban sounds, water sounds, indoor appliances, and animals. Our SSRF-VAD approach achieves an improvement of 3 % in F1-score using only 20 % of the labeled training data compared to state-of-the-art MarbleNet model trained on the complete training dataset. Feature selection, implemented using two distinct feature importance techniques, SHAP and GINI, reduces the feature vector dimensionality by 75 % while preserving accuracy. Further, a novel three-class classification for separating clean speech, noisy speech, and non-speech audio segments with the proposed technique achieves 98.74 % accuracy with 0.982 F1-score. This framework enhances speech analysis and noise characterization, contributing to efficient speech enhancement. Thus, the proposed SSRF-VAD method reduces the requirement for labeled data and can be implemented on resource-constrained devices such as smart hearing aids and smart home assistants.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"234 ","pages":"Article 110636"},"PeriodicalIF":3.4000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X25001082","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
Self-supervised random forests for robust voice activity detection with limited labeled data
Voice activity detection is essential for various downstream speech-related applications. Existing deep learning models for voice activity detection and speech recognition are available, but they often require substantial annotated data and assume noise-free environments. This limitation hinders their application to the vast but sparsely labeled audio datasets available. To address this gap, we propose a novel approach: self-supervised random forest voice activity detection (SSRF-VAD), designed for noisy environments and limited labeled data. We integrate a set of five handcrafted features to optimize performance under mixed signal-to-noise ratios (SNRs). The study incorporates various noise classes covering diverse environmental sounds such as urban sounds, water sounds, indoor appliances, and animals. Our SSRF-VAD approach achieves an improvement of 3 % in F1-score using only 20 % of the labeled training data compared to state-of-the-art MarbleNet model trained on the complete training dataset. Feature selection, implemented using two distinct feature importance techniques, SHAP and GINI, reduces the feature vector dimensionality by 75 % while preserving accuracy. Further, a novel three-class classification for separating clean speech, noisy speech, and non-speech audio segments with the proposed technique achieves 98.74 % accuracy with 0.982 F1-score. This framework enhances speech analysis and noise characterization, contributing to efficient speech enhancement. Thus, the proposed SSRF-VAD method reduces the requirement for labeled data and can be implemented on resource-constrained devices such as smart hearing aids and smart home assistants.
期刊介绍:
Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense.
Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems.
Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.