Can Layer-Wise SSL Features Improve Zero-Shot ASR Performance for Children’s Speech?

IF 3.9 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Signal Processing Letters Pub Date : 2025-08-25 DOI:10.1109/LSP.2025.3602636

Abhijit Sinha;Hemant Kumar Kathania;Sudarsana Reddy Kadiri;Shrikanth Narayanan

{"title":"Can Layer-Wise SSL Features Improve Zero-Shot ASR Performance for Children’s Speech?","authors":"Abhijit Sinha;Hemant Kumar Kathania;Sudarsana Reddy Kadiri;Shrikanth Narayanan","doi":"10.1109/LSP.2025.3602636","DOIUrl":null,"url":null,"abstract":"Automatic Speech Recognition (ASR) systems often struggle to accurately process children’s speech dueto its distinct and highly variable acoustic and linguistic characteristics. While recent advancements in self-supervised learning (SSL) models have greatly enhanced the transcription of adult speech, accurately transcribing children’s speech remains a significant challenge. This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models - specifically, Wav2Vec2, HuBERT, Data2Vec, and WavLM in improving the performance of ASR for children’s speech in zero-shot scenarios. A detailed analysis of features extracted from these models was conducted, integrating them into a simplified DNN-based ASR system using the Kaldi toolkit. The analysis identified the most effective layers for enhancing ASR performance on children’s speech in a zero-shot scenario, where WSJCAM0 adult speech was used for training and PFSTAR children speech for testing. Experimental results indicated that Layer 22 of the Wav2Vec2 model achieved the lowest Word Error Rate (WER) of 5.15%, representing a 51.64% relative improvement over the direct zero-shot decoding using Wav2Vec2 (WER of 10.65%). Additionally, age group-wise analysis demonstrated consistent performance improvements with increasing age, along with significant gains observed even in younger age groups using the SSL features. Further experiments on the CMU Kids dataset confirmed similar trends, highlighting the generalizability of the proposed approach.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"3759-3763"},"PeriodicalIF":3.9000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11141362/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic Speech Recognition (ASR) systems often struggle to accurately process children’s speech dueto its distinct and highly variable acoustic and linguistic characteristics. While recent advancements in self-supervised learning (SSL) models have greatly enhanced the transcription of adult speech, accurately transcribing children’s speech remains a significant challenge. This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models - specifically, Wav2Vec2, HuBERT, Data2Vec, and WavLM in improving the performance of ASR for children’s speech in zero-shot scenarios. A detailed analysis of features extracted from these models was conducted, integrating them into a simplified DNN-based ASR system using the Kaldi toolkit. The analysis identified the most effective layers for enhancing ASR performance on children’s speech in a zero-shot scenario, where WSJCAM0 adult speech was used for training and PFSTAR children speech for testing. Experimental results indicated that Layer 22 of the Wav2Vec2 model achieved the lowest Word Error Rate (WER) of 5.15%, representing a 51.64% relative improvement over the direct zero-shot decoding using Wav2Vec2 (WER of 10.65%). Additionally, age group-wise analysis demonstrated consistent performance improvements with increasing age, along with significant gains observed even in younger age groups using the SSL features. Further experiments on the CMU Kids dataset confirmed similar trends, highlighting the generalizability of the proposed approach.

查看原文本刊更多论文

分层SSL功能可以改善儿童语音的零射击ASR性能吗？

自动语音识别（ASR）系统往往难以准确地处理儿童的语音，因为其独特的和高度可变的声学和语言特征。虽然自我监督学习（SSL）模型的最新进展极大地增强了成人语音的转录，但准确转录儿童语音仍然是一个重大挑战。本研究调查了从最先进的SSL预训练模型中提取的分层特征的有效性-特别是，Wav2Vec2, HuBERT， Data2Vec和WavLM在提高零射击场景下儿童语音的ASR性能方面。从这些模型中提取的特征进行了详细的分析，并使用Kaldi工具包将它们整合到一个简化的基于dnn的ASR系统中。分析确定了在零射击场景下提高儿童ASR表现的最有效层，其中使用WSJCAM0成人语音进行训练，使用PFSTAR儿童语音进行测试。实验结果表明，Wav2Vec2模型的第22层实现了最低的单词错误率（WER），为5.15%，比使用Wav2Vec2的直接零码解码（WER为10.65%）提高了51.64%。此外，针对年龄组的分析表明，随着年龄的增长，性能得到了一致的提高，甚至在使用SSL特性的较年轻年龄组中也有显著的提高。在CMU Kids数据集上的进一步实验证实了类似的趋势，突出了所提出方法的可泛化性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Signal Processing Letters 工程技术-工程：电子与电气

CiteScore

7.40

自引率

12.80%

发文量

339

审稿时长

2.8 months

期刊介绍： The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.