Low-resource speech translation of Urdu to English using semi-supervised part-of-speech tagging and transliteration

2008 IEEE Spoken Language Technology Workshop Pub Date : 2008-12-01 DOI:10.1109/SLT.2008.4777891

A. Aminzadeh, Wade Shen

引用次数: 3

Abstract

This paper describes the construction of ASR and MT systems for translation of speech from Urdu into English. As both Urdu pronunciation lexicons and Urdu-English bitexts are sparse, we employ several techniques that make use of semi-supervised annotation to improve ASR and MT training. Specifically, we describe 1) the construction of a semi-supervised HMM-based part-of-speech tagger that is used to train factored translation models and 2) the use of an HMM-based transliterator from which we derive a spelling-to-pronunciation model for Urdu used in ASR training. We describe experiments performed for both ASR and MT training in the context of the Urdu-to-English task of the NIST MT08 Evaluation and we compare methods making use of additional annotation with standard statistical MT and ASR baselines.

查看原文本刊更多论文

使用半监督词性标注和音译的乌尔都语到英语的低资源语音翻译

本文介绍了乌尔都语语音翻译系统和机器翻译系统的构建。由于乌尔都语发音词汇和乌尔都语-英语比特文本都是稀疏的，我们采用了几种利用半监督注释的技术来改进ASR和MT训练。具体来说，我们描述了1)基于半监督hmm的词性标注器的构建，用于训练因子翻译模型;2)基于hmm的转写器的使用，我们从中获得了用于ASR训练的乌尔都语拼写到发音模型。我们描述了在NIST MT08评估的乌尔都语-英语任务背景下进行的ASR和MT训练的实验，并将使用附加注释的方法与标准统计MT和ASR基线进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 IEEE Spoken Language Technology Workshop

自引率

0.00%

发文量