Learning Distributed Representations for Multilingual Text Sequences

VS@HLT-NAACL Pub Date : 2015-06-01 DOI:10.3115/v1/W15-1512

Hieu Pham, Thang Luong, Christopher D. Manning

引用次数: 61

Abstract

We propose a novel approach to learning distributed representations of variable-length text sequences in multiple languages simultaneously. Unlike previous work which often derive representations of multi-word sequences as weighted sums of individual word vectors, our model learns distributed representations for phrases and sentences as a whole. Our work is similar in spirit to the recent paragraph vector approach but extends to the bilingual context so as to efficiently encode meaning-equivalent text sequences of multiple languages in the same semantic space. Our learned embeddings achieve state-of-theart performance in the often used crosslingual document classification task (CLDC) with an accuracy of 92.7 for English to German and 91.5 for German to English. By learning text sequence representations as a whole, our model performs equally well in both classification directions in the CLDC task in which past work did not achieve.

查看原文本刊更多论文

学习多语言文本序列的分布式表示

我们提出了一种同时学习多种语言的变长文本序列的分布式表示的新方法。与以往的工作不同，我们的模型通常将多词序列的表示作为单个词向量的加权和，而将短语和句子作为一个整体来学习分布式表示。我们的工作在精神上与最近的段落向量方法相似，但扩展到双语上下文，以便在同一语义空间中有效地编码多种语言的意义等效文本序列。我们学习的嵌入在常用的跨语言文档分类任务(CLDC)中实现了最先进的性能，英语到德语的准确率为92.7，德语到英语的准确率为91.5。通过整体学习文本序列表示，我们的模型在CLDC任务的两个分类方向上都表现良好，这是过去的工作无法实现的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

VS@HLT-NAACL

自引率

0.00%

发文量