Word Translation using Cross-Lingual Word Embedding: Case of Sanskrit to Hindi Translation

Rashi Kumar, V. Sahula
{"title":"Word Translation using Cross-Lingual Word Embedding: Case of Sanskrit to Hindi Translation","authors":"Rashi Kumar, V. Sahula","doi":"10.1109/AISP53593.2022.9760564","DOIUrl":null,"url":null,"abstract":"Sanskrit is a low resource language for which large parallel data sets are not available. Large parallel data sets are required for Machine Translation. Cross-Lingual word embedding helps to learn the meaning of words across languages in a shared vector space. In the present work, we propose a translation technique between Sanskrit and Hindi words without a parallel corpus-base. Here, fastText pre-trained word embedding for Sanskrit and Hindi are used and are aligned in the same vector space using Singular Value Decomposition and a Quasi bilingual dictionary. A Quasi bilingual dictionary is generated from similar character string words in the monolingual word embeddings of both languages. Translations for the test dictionary are evaluated on the various retrieval methods e.g. Nearest neighbor, Inverted Sofmax approach, and Cross-domain Similarity Local Scaling, in order to address the issue of hubness that arises due to the high dimensional space of the vector embeddings. The results are compared with the other Unsupervised approaches at 1, 10, and 20 neighbors. While computing the Cosine similarity, we observed that the similarity between the expected and the translated target words is either close to unity or equal to unity for the cases that were even not included in the Quasi bilingual dictionary that was used to generate the orthogonal mapping. A test dictionary was developed from the Wikipedia Sanskrit-Hindi Shabdkosh to test the translation accuracy of the system. The proposed method is being extended for sentence translation.","PeriodicalId":6793,"journal":{"name":"2022 2nd International Conference on Artificial Intelligence and Signal Processing (AISP)","volume":"10 1","pages":"1-7"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 2nd International Conference on Artificial Intelligence and Signal Processing (AISP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AISP53593.2022.9760564","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Sanskrit is a low resource language for which large parallel data sets are not available. Large parallel data sets are required for Machine Translation. Cross-Lingual word embedding helps to learn the meaning of words across languages in a shared vector space. In the present work, we propose a translation technique between Sanskrit and Hindi words without a parallel corpus-base. Here, fastText pre-trained word embedding for Sanskrit and Hindi are used and are aligned in the same vector space using Singular Value Decomposition and a Quasi bilingual dictionary. A Quasi bilingual dictionary is generated from similar character string words in the monolingual word embeddings of both languages. Translations for the test dictionary are evaluated on the various retrieval methods e.g. Nearest neighbor, Inverted Sofmax approach, and Cross-domain Similarity Local Scaling, in order to address the issue of hubness that arises due to the high dimensional space of the vector embeddings. The results are compared with the other Unsupervised approaches at 1, 10, and 20 neighbors. While computing the Cosine similarity, we observed that the similarity between the expected and the translated target words is either close to unity or equal to unity for the cases that were even not included in the Quasi bilingual dictionary that was used to generate the orthogonal mapping. A test dictionary was developed from the Wikipedia Sanskrit-Hindi Shabdkosh to test the translation accuracy of the system. The proposed method is being extended for sentence translation.
跨语言词嵌入的词翻译:以梵语到印地语的翻译为例
梵语是一种低资源语言,无法获得大型并行数据集。机器翻译需要大量的并行数据集。跨语言词嵌入有助于在共享向量空间中学习跨语言词的含义。在本研究中,我们提出了一种无需平行语料库的梵语和印地语词汇翻译技术。在这里,使用fastText预训练的梵语和印地语词嵌入,并使用奇异值分解和准双语字典在相同的向量空间中对齐。准双语词典是由两种语言的单语词嵌入中相似的字符串词生成的。测试字典的翻译在各种检索方法上进行评估,例如最近邻,倒Sofmax方法和跨域相似局部缩放,以解决由于向量嵌入的高维空间而产生的中心问题。将结果与其他无监督方法在1、10和20个邻居处进行比较。在计算余弦相似度时,我们观察到,对于用于生成正交映射的准双语字典中甚至没有包含的情况,期望词与翻译目标词之间的相似度要么接近于单位,要么等于单位。从维基百科梵语-印地语Shabdkosh中开发了一个测试词典来测试该系统的翻译准确性。将该方法扩展到句子翻译中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信