{"title":"Word Embedding in Small Corpora: A Case Study in Quran","authors":"Zeinab Aghahadi, A. Talebpour","doi":"10.1109/ICCKE.2018.8566605","DOIUrl":null,"url":null,"abstract":"Text is a complex set of words to carry the meaning and representations of words is the first step to perform linguistic processing and text comprehension. So far, many researches have been done on the semantic representations of words using neural networks in various areas of natural language processing using large text corpus from general domain. In the meantime, some efforts have been done to apply deep learning methods to represent the words of small corpus supporting the hypothesis that the bigger corpora doesn't necessarily provide better results in words representation. In this research the capability of word2vec for learning semantic representation of words in small corpus is investigated. Here, we consider Skip-gram and CBOW learning models with different values of hyper parameters. Two new data sets have been created to evaluate the model's performance on the small domain-specific Quranic corpus. First and second datasets are used to test the words categorization and word pairwise similarity respectively. Our results demonstrate that the best performance for skip-gram occurs with 30 numbers of iterations when the dimension is set to 7.","PeriodicalId":283700,"journal":{"name":"2018 8th International Conference on Computer and Knowledge Engineering (ICCKE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 8th International Conference on Computer and Knowledge Engineering (ICCKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCKE.2018.8566605","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Text is a complex set of words to carry the meaning and representations of words is the first step to perform linguistic processing and text comprehension. So far, many researches have been done on the semantic representations of words using neural networks in various areas of natural language processing using large text corpus from general domain. In the meantime, some efforts have been done to apply deep learning methods to represent the words of small corpus supporting the hypothesis that the bigger corpora doesn't necessarily provide better results in words representation. In this research the capability of word2vec for learning semantic representation of words in small corpus is investigated. Here, we consider Skip-gram and CBOW learning models with different values of hyper parameters. Two new data sets have been created to evaluate the model's performance on the small domain-specific Quranic corpus. First and second datasets are used to test the words categorization and word pairwise similarity respectively. Our results demonstrate that the best performance for skip-gram occurs with 30 numbers of iterations when the dimension is set to 7.