{"title":"Word Translation using Cross-Lingual Word Embedding: Case of Sanskrit to Hindi Translation","authors":"Rashi Kumar, V. Sahula","doi":"10.1109/AISP53593.2022.9760564","DOIUrl":null,"url":null,"abstract":"Sanskrit is a low resource language for which large parallel data sets are not available. Large parallel data sets are required for Machine Translation. Cross-Lingual word embedding helps to learn the meaning of words across languages in a shared vector space. In the present work, we propose a translation technique between Sanskrit and Hindi words without a parallel corpus-base. Here, fastText pre-trained word embedding for Sanskrit and Hindi are used and are aligned in the same vector space using Singular Value Decomposition and a Quasi bilingual dictionary. A Quasi bilingual dictionary is generated from similar character string words in the monolingual word embeddings of both languages. Translations for the test dictionary are evaluated on the various retrieval methods e.g. Nearest neighbor, Inverted Sofmax approach, and Cross-domain Similarity Local Scaling, in order to address the issue of hubness that arises due to the high dimensional space of the vector embeddings. The results are compared with the other Unsupervised approaches at 1, 10, and 20 neighbors. While computing the Cosine similarity, we observed that the similarity between the expected and the translated target words is either close to unity or equal to unity for the cases that were even not included in the Quasi bilingual dictionary that was used to generate the orthogonal mapping. A test dictionary was developed from the Wikipedia Sanskrit-Hindi Shabdkosh to test the translation accuracy of the system. The proposed method is being extended for sentence translation.","PeriodicalId":6793,"journal":{"name":"2022 2nd International Conference on Artificial Intelligence and Signal Processing (AISP)","volume":"10 1","pages":"1-7"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 2nd International Conference on Artificial Intelligence and Signal Processing (AISP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AISP53593.2022.9760564","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Sanskrit is a low resource language for which large parallel data sets are not available. Large parallel data sets are required for Machine Translation. Cross-Lingual word embedding helps to learn the meaning of words across languages in a shared vector space. In the present work, we propose a translation technique between Sanskrit and Hindi words without a parallel corpus-base. Here, fastText pre-trained word embedding for Sanskrit and Hindi are used and are aligned in the same vector space using Singular Value Decomposition and a Quasi bilingual dictionary. A Quasi bilingual dictionary is generated from similar character string words in the monolingual word embeddings of both languages. Translations for the test dictionary are evaluated on the various retrieval methods e.g. Nearest neighbor, Inverted Sofmax approach, and Cross-domain Similarity Local Scaling, in order to address the issue of hubness that arises due to the high dimensional space of the vector embeddings. The results are compared with the other Unsupervised approaches at 1, 10, and 20 neighbors. While computing the Cosine similarity, we observed that the similarity between the expected and the translated target words is either close to unity or equal to unity for the cases that were even not included in the Quasi bilingual dictionary that was used to generate the orthogonal mapping. A test dictionary was developed from the Wikipedia Sanskrit-Hindi Shabdkosh to test the translation accuracy of the system. The proposed method is being extended for sentence translation.