{"title":"Exploring knowledge distillation for low-resource multi-modal streaming ASR in the CHiME-8 MMCSG challenge","authors":"Hongbo Lan, Ya Jiang, Jun Du, Qing Wang","doi":"10.1016/j.csl.2025.101837","DOIUrl":null,"url":null,"abstract":"<div><div>In the CHiME-8 Multi-modal Conversational Speech Recognition for Smart Glasses (MMCSG) challenge, participants were tasked with achieving real-time transcription of two-person conversations recorded with smart glasses. To address the scarcity of real-world data, we propose a knowledge distillation framework where a non-streaming teacher model, trained on augmented multi-channel audio, guides a streaming student model. Leveraging simulated data with varying overlap rates, the framework employs a logit-based Kullback–Leibler divergence loss alongside mean square error losses on hidden states and attention maps of Fast-Conformer layers to transfer knowledge from the teacher to the student, significantly improving the performance of the audio-only streaming automatic speech recognition (ASR) model. Furthermore, we exploit the synergy and complementarity of inertial measurement unit and audio data by developing a novel multi-modal streaming ASR model. Meanwhile, cross-modal distillation is performed by adopting the non-streaming audio-only teacher to guide the streaming multi-modal student. Experimental results demonstrate that our proposed multi-modal fusion and teacher-student learning framework effectively enhance the performance of streaming ASR models. Notably, our approach secured the first place in the sub-track of the CHiME-8 MMCSG challenge.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101837"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000622","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In the CHiME-8 Multi-modal Conversational Speech Recognition for Smart Glasses (MMCSG) challenge, participants were tasked with achieving real-time transcription of two-person conversations recorded with smart glasses. To address the scarcity of real-world data, we propose a knowledge distillation framework where a non-streaming teacher model, trained on augmented multi-channel audio, guides a streaming student model. Leveraging simulated data with varying overlap rates, the framework employs a logit-based Kullback–Leibler divergence loss alongside mean square error losses on hidden states and attention maps of Fast-Conformer layers to transfer knowledge from the teacher to the student, significantly improving the performance of the audio-only streaming automatic speech recognition (ASR) model. Furthermore, we exploit the synergy and complementarity of inertial measurement unit and audio data by developing a novel multi-modal streaming ASR model. Meanwhile, cross-modal distillation is performed by adopting the non-streaming audio-only teacher to guide the streaming multi-modal student. Experimental results demonstrate that our proposed multi-modal fusion and teacher-student learning framework effectively enhance the performance of streaming ASR models. Notably, our approach secured the first place in the sub-track of the CHiME-8 MMCSG challenge.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.