基于知识精馏的小足迹声学模型高效构建策略

2018 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2018-12-01 DOI:10.1109/SLT.2018.8639545

Takafumi Moriya, Hiroki Kanagawa, Kiyoaki Matsui, Takaaki Fukutomi, Yusuke Shinohara, Y. Yamaguchi, M. Okamoto, Y. Aono

{"title":"基于知识精馏的小足迹声学模型高效构建策略","authors":"Takafumi Moriya, Hiroki Kanagawa, Kiyoaki Matsui, Takaaki Fukutomi, Yusuke Shinohara, Y. Yamaguchi, M. Okamoto, Y. Aono","doi":"10.1109/SLT.2018.8639545","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a novel training strategy for deep neural network (DNN) based small-footprint acoustic models. The accuracy of DNN-based automatic speech recognition (ASR) systems can be greatly improved by leveraging large amounts of data to improve the level of expression. DNNs use many parameters to enhance recognition performance. Unfortunately, resource-constrained local devices are unable to run complex DNN-based ASR systems. For building compact acoustic models, the knowledge distillation (KD) approach is often used. KD uses a large, well-trained model that outputs target labels to train a compact model. However, the standard KD cannot fully utilize the large model outputs to train compact models because the soft logits provide only rough information. We assume that the large model must give more useful hints to the compact model. We propose an advanced KD that uses mean squared error to minimize the discrepancies between the final hidden layer outputs. We evaluate our proposal on recorded speech data sets assuming car-and home-use scenarios, and show that our models achieve lower character error rates than the conventional KD approach or from-scratch training on computation resource-constrained devices.","PeriodicalId":377307,"journal":{"name":"2018 IEEE Spoken Language Technology Workshop (SLT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Efficient Building Strategy with Knowledge Distillation for Small-Footprint Acoustic Models\",\"authors\":\"Takafumi Moriya, Hiroki Kanagawa, Kiyoaki Matsui, Takaaki Fukutomi, Yusuke Shinohara, Y. Yamaguchi, M. Okamoto, Y. Aono\",\"doi\":\"10.1109/SLT.2018.8639545\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose a novel training strategy for deep neural network (DNN) based small-footprint acoustic models. The accuracy of DNN-based automatic speech recognition (ASR) systems can be greatly improved by leveraging large amounts of data to improve the level of expression. DNNs use many parameters to enhance recognition performance. Unfortunately, resource-constrained local devices are unable to run complex DNN-based ASR systems. For building compact acoustic models, the knowledge distillation (KD) approach is often used. KD uses a large, well-trained model that outputs target labels to train a compact model. However, the standard KD cannot fully utilize the large model outputs to train compact models because the soft logits provide only rough information. We assume that the large model must give more useful hints to the compact model. We propose an advanced KD that uses mean squared error to minimize the discrepancies between the final hidden layer outputs. We evaluate our proposal on recorded speech data sets assuming car-and home-use scenarios, and show that our models achieve lower character error rates than the conventional KD approach or from-scratch training on computation resource-constrained devices.\",\"PeriodicalId\":377307,\"journal\":{\"name\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2018.8639545\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2018.8639545","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

本文提出了一种基于深度神经网络(DNN)的小足迹声学模型的训练策略。通过利用大量的数据来提高表达水平，可以大大提高基于dnn的自动语音识别(ASR)系统的准确性。dnn使用许多参数来提高识别性能。不幸的是，资源有限的本地设备无法运行复杂的基于dnn的ASR系统。为了建立紧凑的声学模型，通常使用知识蒸馏(KD)方法。KD使用一个大的，训练良好的模型，输出目标标签来训练一个紧凑的模型。然而，标准KD不能充分利用大模型输出来训练紧凑模型，因为软逻辑只提供粗略的信息。我们假定大模型一定会给小模型提供更多有用的提示。我们提出了一种先进的KD，它使用均方误差来最小化最终隐藏层输出之间的差异。我们在假设汽车和家庭使用场景的录音语音数据集上评估了我们的建议，并表明我们的模型比传统的KD方法或在计算资源受限的设备上从头开始训练实现了更低的字符错误率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Building Strategy with Knowledge Distillation for Small-Footprint Acoustic Models

In this paper, we propose a novel training strategy for deep neural network (DNN) based small-footprint acoustic models. The accuracy of DNN-based automatic speech recognition (ASR) systems can be greatly improved by leveraging large amounts of data to improve the level of expression. DNNs use many parameters to enhance recognition performance. Unfortunately, resource-constrained local devices are unable to run complex DNN-based ASR systems. For building compact acoustic models, the knowledge distillation (KD) approach is often used. KD uses a large, well-trained model that outputs target labels to train a compact model. However, the standard KD cannot fully utilize the large model outputs to train compact models because the soft logits provide only rough information. We assume that the large model must give more useful hints to the compact model. We propose an advanced KD that uses mean squared error to minimize the discrepancies between the final hidden layer outputs. We evaluate our proposal on recorded speech data sets assuming car-and home-use scenarios, and show that our models achieve lower character error rates than the conventional KD approach or from-scratch training on computation resource-constrained devices.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量