Multi-Quartznet: Multi-Resolution Convolution for Speech Recognition with Multi-Layer Feature Fusion

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2020-11-26 DOI:10.1109/SLT48900.2021.9383532

Jian Luo, Jianzong Wang, Ning Cheng, Guilin Jiang, Jing Xiao

引用次数: 8

Abstract

In this paper, we propose an end-to-end speech recognition network based on Nvidia’s previous QuartzNet [1] model. We try to promote the model performance, and design three components: (1) Multi-Resolution Convolution Module, re-places the original 1D time-channel separable convolution with multi-stream convolutions. Each stream has a unique dilated stride on convolutional operations. (2) Channel-Wise Attention Module, calculates the attention weight of each convolutional stream by spatial channel-wise pooling. (3) Multi-Layer Feature Fusion Module, reweights each convolutional block by global multi-layer feature maps. Our experiments demonstrate that Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.

查看原文本刊更多论文

多石英网:多分辨率卷积语音识别与多层特征融合

在本文中，我们基于Nvidia之前的QuartzNet[1]模型提出了一个端到端的语音识别网络。我们尝试提高模型的性能，并设计了三个组成部分:(1)多分辨率卷积模块，用多流卷积代替原来的一维时间通道可分离卷积。每个流在卷积操作上都有一个独特的扩展步幅。(2) Channel-Wise Attention Module，通过空间Channel-Wise pooling计算每个卷积流的注意力权重。(3)多层特征融合模块，通过全局多层特征映射重新加权每个卷积块。实验表明，Multi-QuartzNet模型在ahell -1数据集上的识别率达到6.77%，优于原始的QuartzNet模型，接近最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量