Jyotsana Khatri, V. Rudra Murthy, P. Bhattacharyya
{"title":"A Study of Efficacy of Cross-lingual Word Embeddings for Indian Languages","authors":"Jyotsana Khatri, V. Rudra Murthy, P. Bhattacharyya","doi":"10.1145/3371158.3371219","DOIUrl":null,"url":null,"abstract":"Cross-lingual word embeddings have become ubiquitous for various NLP tasks. Existing literature primarily evaluate the quality of cross-lingual word embeddings on the task of Bilingual Lexicon Induction. They report very high accuracies for European languages. In this paper, we report the accuracy of Bilingual Lexicon Induction (BLI) task for cross-lingual word embeddings generated using two mapping based unsupervised approaches: VecMap and MUSE for Indian languages on a dataset created using linked Indian Wordnet. We also show the comparison of these approaches with a simple baseline where the embeddings for all languages are trained using fast-text on the combined corpora of 11 Indian languages. Our experiments show that existing cross-lingual word embedding approaches give low accuracy on bilingual lexicon induction for cognate words. Given the high cognate overlap of several Indian languages, this is a serious limitation of existing approaches.","PeriodicalId":360747,"journal":{"name":"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3371158.3371219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Cross-lingual word embeddings have become ubiquitous for various NLP tasks. Existing literature primarily evaluate the quality of cross-lingual word embeddings on the task of Bilingual Lexicon Induction. They report very high accuracies for European languages. In this paper, we report the accuracy of Bilingual Lexicon Induction (BLI) task for cross-lingual word embeddings generated using two mapping based unsupervised approaches: VecMap and MUSE for Indian languages on a dataset created using linked Indian Wordnet. We also show the comparison of these approaches with a simple baseline where the embeddings for all languages are trained using fast-text on the combined corpora of 11 Indian languages. Our experiments show that existing cross-lingual word embedding approaches give low accuracy on bilingual lexicon induction for cognate words. Given the high cognate overlap of several Indian languages, this is a serious limitation of existing approaches.