English-Chinese Cross Language Word Embedding Similarity Calculation

Artificial Intelligence and Cloud Computing Conference Pub Date : 2018-12-21 DOI:10.1145/3299819.3299831

Like Wang, Yuan Sun, Xiaobing Zhao

引用次数: 0

Abstract

Differences in languages among various countries, regions, and nationalities have created huge obstacles in communication. Cross-language word similarity (CLWS) calculation is the most practical way to solve this problem. Selection of corpus is one of the factors that influence the calculate result. This paper compares the similarity in word embeddings of bilingual parallel and non-parallel corpus on traditional models. Firstly, this paper uses the fastText method to calculate the monolingual word embeddings of Chinese and English, and computes the semantic similarity between the two embeddings. Then it maps the word embeddings into an implicit shared space using Multilingual Unsupervised and Supervised Embedding (MUSE), and compares the effect of unsupervised and supervised machine learning methods in parallel and non-parallel corpus. Finally, the experimental results prove that MUSE model could be better align monolingual word embeddings space, non-parallel corpus have the same effect compares with parallel corpus in calculating the CLWS.

查看原文本刊更多论文

英汉交叉语言词嵌入相似度计算

不同国家、地区和民族之间的语言差异给交流造成了巨大的障碍。跨语言词相似度(CLWS)计算是解决这一问题最实用的方法。语料库的选择是影响计算结果的因素之一。本文比较了传统模型下双语平行语料库和非平行语料库词嵌入的相似度。首先，本文采用fastText方法对中文和英文的单语词嵌入进行计算，并计算两种嵌入之间的语义相似度。然后使用多语言无监督和有监督嵌入(MUSE)将词嵌入映射到隐式共享空间，并比较无监督和有监督机器学习方法在并行和非并行语料库中的效果。最后，实验结果证明MUSE模型可以更好地对齐单语词嵌入空间，非并行语料库与并行语料库在计算CLWS方面效果相同。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial Intelligence and Cloud Computing Conference

自引率

0.00%

发文量