Unwritten languages demand attention too! Word discovery with encoder-decoder models

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-09-17 DOI:10.1109/ASRU.2017.8268972

Marcely Zanon Boito, Alexandre Berard, Aline Villavicencio, L. Besacier

引用次数: 22

Abstract

Word discovery is the task of extracting words from un-segmented text. In this paper we examine to what extent neural networks can be applied to this task in a realistic unwritten language scenario, where only small corpora and limited annotations are available. We investigate two scenarios: one with no supervision and another with limited supervision with access to the most frequent words. Obtained results show that it is possible to retrieve at least 27% of the gold standard vocabulary by training an encoder-decoder neural machine translation system with only 5,157 sentences. This result is close to those obtained with a task-specific Bayesian nonparametric model. Moreover, our approach has the advantage of generating translation alignments, which could be used to create a bilingual lexicon. As a future perspective, this approach is also well suited to work directly from speech.

查看原文本刊更多论文

不成文的语言也需要注意!使用编码器-解码器模型的单词发现

单词发现是从未分割的文本中提取单词的任务。在本文中，我们研究了在现实的非书面语言场景中，神经网络在多大程度上可以应用于这项任务，其中只有小的语料库和有限的注释可用。我们研究了两种情况:一种是没有监督，另一种是有限制的监督，可以访问最频繁的单词。得到的结果表明，通过训练一个只有5157个句子的编码器-解码器神经机器翻译系统，可以检索至少27%的金标准词汇。该结果与特定任务贝叶斯非参数模型的结果接近。此外，我们的方法具有生成翻译对齐的优势，可用于创建双语词典。从未来的角度来看，这种方法也非常适合直接从语音中进行工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量