A Study of Efficacy of Cross-lingual Word Embeddings for Indian Languages

Proceedings of the 7th ACM IKDD CoDS and 25th COMAD Pub Date : 2020-01-05 DOI:10.1145/3371158.3371219

Jyotsana Khatri, V. Rudra Murthy, P. Bhattacharyya

引用次数: 4

Abstract

Cross-lingual word embeddings have become ubiquitous for various NLP tasks. Existing literature primarily evaluate the quality of cross-lingual word embeddings on the task of Bilingual Lexicon Induction. They report very high accuracies for European languages. In this paper, we report the accuracy of Bilingual Lexicon Induction (BLI) task for cross-lingual word embeddings generated using two mapping based unsupervised approaches: VecMap and MUSE for Indian languages on a dataset created using linked Indian Wordnet. We also show the comparison of these approaches with a simple baseline where the embeddings for all languages are trained using fast-text on the combined corpora of 11 Indian languages. Our experiments show that existing cross-lingual word embedding approaches give low accuracy on bilingual lexicon induction for cognate words. Given the high cognate overlap of several Indian languages, this is a serious limitation of existing approaches.

查看原文本刊更多论文

印度语跨语言词嵌入的有效性研究

跨语言词嵌入在各种NLP任务中已经变得无处不在。现有文献主要评价跨语言词嵌入在双语词汇归纳任务中的质量。他们对欧洲语言的准确率非常高。在本文中，我们报告了使用两种基于映射的无监督方法(VecMap和MUSE)在使用链接的印度Wordnet创建的数据集上生成的跨语言词嵌入的双语词典归纳(BLI)任务的准确性。我们还展示了这些方法与一个简单基线的比较，在这个基线中，所有语言的嵌入都是使用11种印度语言的组合语料库上的快速文本进行训练的。我们的实验表明，现有的跨语言词嵌入方法对同源词的双语词汇归纳准确率较低。鉴于几种印度语言的高度同源重叠，这是现有方法的严重限制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 7th ACM IKDD CoDS and 25th COMAD

自引率

0.00%

发文量