Continuous topic language modeling for speech recognition

2008 IEEE Spoken Language Technology Workshop Pub Date : 2008-12-01 DOI:10.1109/SLT.2008.4777873

C. Chueh, Jen-Tzung Chien

引用次数: 1

Abstract

Continuous representation of word sequence can effectively solve data sparseness problem in n-gram language model, where the discrete variables of words are represented and the unseen events are prone to happen. This problem is increasingly severe when extracting long-distance regularities for high-order n-gram model. Rather than considering discrete word space, we construct the continuous space of word sequence where the latent topic information is extracted. The continuous vector is formed by the topic posterior probabilities and the least-squares projection matrix from discrete word space to continuous topic space is estimated accordingly. The unseen words can be predicted through the new continuous latent topic language model. In the experiments on continuous speech recognition, we obtain significant performance improvement over the conventional topic-based language model.

查看原文本刊更多论文

语音识别的连续主题语言建模

单词序列的连续表示可以有效地解决n-gram语言模型中的数据稀疏问题，其中单词的离散变量被表示，不可见的事件容易发生。在对高阶n-gram模型进行长距离规律提取时，这一问题日益严重。我们不再考虑离散词空间，而是构建词序列的连续空间，在连续空间中提取潜在的主题信息。由主题后验概率形成连续向量，并估计离散词空间到连续主题空间的最小二乘投影矩阵。通过新的连续潜在主题语言模型，可以对未见词进行预测。在连续语音识别实验中，与传统的基于主题的语言模型相比，我们获得了显著的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 IEEE Spoken Language Technology Workshop

自引率

0.00%

发文量