Factorized and progressive knowledge distillation for CTC-based ASR models

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2024-05-01 DOI:10.1016/j.specom.2024.103071

Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao

{"title":"Factorized and progressive knowledge distillation for CTC-based ASR models","authors":"Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao","doi":"10.1016/j.specom.2024.103071","DOIUrl":null,"url":null,"abstract":"<div><p>Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model’s posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher’s blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model’s learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher’s posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103071"},"PeriodicalIF":2.4000,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000438","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model’s posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher’s blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model’s learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher’s posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.

查看原文本刊更多论文

基于 CTC 的 ASR 模型的因子化和渐进式知识提炼

知识蒸馏（KD）是一种流行的模型压缩方法，通过将知识从教师模型转移到学生模型来提高轻量级模型的性能。然而，由于其峰值后验特性，将 KD 应用于连接主义时序分类 (CTC) ASR 模型具有挑战性。本文建议通过区别对待非空白帧和空白帧来解决这一问题，主要原因有两个。首先，在教师模型的后验矩阵和隐藏表征中，非空白帧比空白帧提供了更多的声学和语言信息，但非空白帧的帧数只占所有帧数的一小部分，从而导致严重的学习不平衡问题。其次，教师空白帧后验中的非空白标记呈现出不规则的概率分布，对学生模型的学习产生了负面影响。因此，我们建议对非空白帧和空白帧进行因子化提炼，并进一步将其结合到渐进式 KD 框架中，该框架包含三个增量阶段，以促进学生模型逐步积累知识。第一阶段是一个简单的二元分类 KD 任务，学生在这个任务中学习如何区分非空白帧和空白帧，因为这两种类型的帧会在后续阶段分别学习。第二阶段是基于因式分解表征的 KD，在这一阶段中，隐藏表征被分为非空白帧和空白帧，从而可以均衡地提炼出这两种帧。在第三阶段，学生通过我们提出的因子化 KL-发散（FKL）方法从教师的后验矩阵中学习，该方法对空白帧和非空白帧后验进行不同的操作，以缓解不平衡问题并减少不规则概率分布的影响。与基线相比，我们提出的方法在 Aishell-1 数据集上实现了 22.5% 的相对 CER 降低，在 Tedlium-2 数据集上实现了 23.0% 的相对 WER 降低，在 LibriSpeech 数据集上实现了 17.6% 的相对 WER 降低。为了展示我们方法的通用性，我们还在 CTC/Attention 混合架构以及跨模型拓扑 KD 场景中对我们的方法进行了评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.