PLDE: A lightweight pooling layer for spoken language recognition

IF 2.4 3区 计算机科学 Q2 ACOUSTICS
Zimu Li , Yanyan Xu , Dengfeng Ke , Kaile Su
{"title":"PLDE: A lightweight pooling layer for spoken language recognition","authors":"Zimu Li ,&nbsp;Yanyan Xu ,&nbsp;Dengfeng Ke ,&nbsp;Kaile Su","doi":"10.1016/j.specom.2024.103055","DOIUrl":null,"url":null,"abstract":"<div><p>In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the <span><math><msub><mrow><mi>C</mi></mrow><mrow><mi>a</mi><mi>v</mi><mi>g</mi></mrow></msub></math></span> of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"158 ","pages":"Article 103055"},"PeriodicalIF":2.4000,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016763932400027X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

In recent years, the transfer learning method of replacing acoustic features with phonetic features has become a new paradigm for end-to-end spoken language recognition. However, these larger transfer learning models always encode too much redundant information. In this paper, we propose a lightweight language recognition decoder based on a phonetic learnable dictionary encoding (PLDE) layer, which is more suitable for phonetic features and achieves better recognition performances while significantly reducing the number of parameters. The lightweight decoder consists of three main parts: (1) a phonetic learnable dictionary with ghost clusters, which improves the traditional LDE pooling layer and enhances the model’s ability to model noise with ghost clusters; (2) coarse-grained chunk-level pooling, which can highlight the phone sequence and suppress noise around ghost clusters, and hence reduce their influence to the subsequent network; (3) fine-grained chunk-level projection, which enables the discriminative network to obtain more linguistic information and hence improve the model’s modelling ability. These three parts simplify the language recognition decoder into a PLDE pooling layer, reducing the parameter size of the decoder by at least one order of magnitude while achieving better recognition performances. In experiments on the OLR2020 dataset, the Cavg of the proposed method exceeds that of the current state-of-the-art language recognition system, achieving 24.68% and 42.24% improvements on the cross-channel test set and unknown noise test set, respectively. Furthermore, experimental results on the OLR2021 dataset also demonstrate the effectiveness of PLDE.

Abstract Image

PLDE:用于口语识别的轻量级汇集层
近年来,用语音特征替代声学特征的迁移学习方法已成为端到端口语识别的新范式。然而,这些较大的迁移学习模型总是编码过多的冗余信息。在本文中,我们提出了一种基于语音可学习字典编码(PLDE)层的轻量级语言识别解码器,它更适合语音特征,在大幅减少参数数量的同时实现了更好的识别性能。轻量级解码器主要由三部分组成:(1)带鬼簇的语音可学习字典,它改进了传统的 LDE 汇集层,提高了模型对带鬼簇噪声的建模能力;(2)粗粒度的块级汇集,它能突出电话序列,抑制鬼簇周围的噪声,从而减少鬼簇对后续网络的影响;(3)细粒度的块级投影,它能使判别网络获得更多的语言信息,从而提高模型的建模能力。这三个部分将语言识别解码器简化为 PLDE 池层,将解码器的参数大小减少了至少一个数量级,同时实现了更好的识别性能。在 OLR2020 数据集的实验中,所提方法的 Cavg 超过了目前最先进的语言识别系统,在跨信道测试集和未知噪声测试集上分别提高了 24.68% 和 42.24%。此外,在 OLR2021 数据集上的实验结果也证明了 PLDE 的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信