{"title":"Adaptive GloVe and FastText Model for Hindi Word Embeddings","authors":"Vijay Gaikwad, Y. Haribhakta","doi":"10.1145/3371158.3371179","DOIUrl":null,"url":null,"abstract":"Today, a lot of research is carried out on word embeddings in NLP domain. The algorithms like GloVe, FastText are used to develop word embeddings. However, not enough work is done on Indian languages due to lack of resource availability. The datasets required for testing word embeddings are not available for Indian languages. In this paper, two algorithms are proposed - Adaptive GloVe model (AGM) and Adaptive FastText model (AFM). Adapting to the co-occurrence matrix generation process of the original GloVe model, AGM, leverages part of speech tags, morphological knowledge of the language. Assigning higher co-occurrence weight to words with same root, AGM, significantly improved accuracy of resultant word embeddings on syntactic datasets. Whereas, AFM improves the vocabulary building process of the original FastText model. The work involves generation of word embeddings for low resource language like Hindi using AGM and AFM and creation of necessary test datasets for evaluating word embeddings. AGM word embeddings showed morphological awareness, achieving 9% increase in accuracy on syntactic word analogy task, compared to original GloVe model. AFM outperformed FastText by 1% accuracy in word analogy task and 2 Spearman rank on word similarity task, providing state-of-the-art performance.","PeriodicalId":360747,"journal":{"name":"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th ACM IKDD CoDS and 25th COMAD","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3371158.3371179","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Today, a lot of research is carried out on word embeddings in NLP domain. The algorithms like GloVe, FastText are used to develop word embeddings. However, not enough work is done on Indian languages due to lack of resource availability. The datasets required for testing word embeddings are not available for Indian languages. In this paper, two algorithms are proposed - Adaptive GloVe model (AGM) and Adaptive FastText model (AFM). Adapting to the co-occurrence matrix generation process of the original GloVe model, AGM, leverages part of speech tags, morphological knowledge of the language. Assigning higher co-occurrence weight to words with same root, AGM, significantly improved accuracy of resultant word embeddings on syntactic datasets. Whereas, AFM improves the vocabulary building process of the original FastText model. The work involves generation of word embeddings for low resource language like Hindi using AGM and AFM and creation of necessary test datasets for evaluating word embeddings. AGM word embeddings showed morphological awareness, achieving 9% increase in accuracy on syntactic word analogy task, compared to original GloVe model. AFM outperformed FastText by 1% accuracy in word analogy task and 2 Spearman rank on word similarity task, providing state-of-the-art performance.