Learning a subword vocabulary based on unigram likelihood

2013 IEEE Workshop on Automatic Speech Recognition and Understanding Pub Date : 2013-12-01 DOI:10.1109/ASRU.2013.6707697

Matti Varjokallio, M. Kurimo, Sami Virpioja

引用次数: 19

Abstract

Using words as vocabulary units for tasks like speech recognition is infeasible for many morphologically rich languages, including Finnish. Thus, subword units are commonly used for language modeling. This work presents a novel algorithm for creating a subword vocabulary, based on the unigram likelihood of a text corpus. The method is evaluated with entropy measure and a Finnish LVCSR task. Unigram entropy of the text corpus is shown to be a good indicator for the quality of higher order n-gram models, also resulting in high speech recognition accuracy.

查看原文本刊更多论文

学习基于一元似然的子词词汇

在语音识别等任务中使用单词作为词汇单位对于许多词法丰富的语言(包括芬兰语)是不可行的。因此，子词单位通常用于语言建模。这项工作提出了一种基于文本语料库的单图似然来创建子词词汇的新算法。用熵测度和芬兰LVCSR任务对该方法进行了评价。文本语料库的单图熵是高阶n-图模型质量的良好指标，也导致了较高的语音识别精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE Workshop on Automatic Speech Recognition and Understanding

自引率

0.00%

发文量