Nafiz Sadman, Akib Sadmanee, Md. Iftekhar Tanveer, Md. Ashraful Amin, A. Ali
{"title":"孟加拉语词嵌入的内在评价","authors":"Nafiz Sadman, Akib Sadmanee, Md. Iftekhar Tanveer, Md. Ashraful Amin, A. Ali","doi":"10.1109/ICBSLP47725.2019.201506","DOIUrl":null,"url":null,"abstract":"Word embeddings are vector representations of word that allow machines to learn semantic and syntactic meanings by performing computations on them. Two wellknown embedding models are CBOW and Skipgram. Different methods proposed to evaluate the quality of embeddings are categorized into extrinsic and intrinsic evaluation methods. This paper focuses on intrinsic evaluation - the evaluation of the models on tasks, such as analogy prediction, semantic relatedness, synonym detection, antonym detection and concept categorization. We present intrinsic evaluations on Bangla word embedding created using CBOW and Skipgram models on a Bangla corpus that we built. These are trained on more than 700,000 articles consisting of more than 1.3 million unique words with different embedding dimension sizes, e.g., 300, 100, 64, and 32. We created the evaluation datasets for the abovementioned tasks and performed a comprehensive evaluation. We observe, word vectors of dimension 300, produced using Skipgram models, achieves accuracy of 51.33% for analogy prediction, a correlation of 0.62 for semantic relatedness, and accuracy of 53.85% and 9.56% for synonym and antonym detection 9.56%. Finally, for concept categorization the accuracy is 91.02%. The corpus and evaluation datasets are made publicly available for further research.","PeriodicalId":413077,"journal":{"name":"2019 International Conference on Bangla Speech and Language Processing (ICBSLP)","volume":"19 10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Intrinsic Evaluation of Bangla Word Embeddings\",\"authors\":\"Nafiz Sadman, Akib Sadmanee, Md. Iftekhar Tanveer, Md. Ashraful Amin, A. Ali\",\"doi\":\"10.1109/ICBSLP47725.2019.201506\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word embeddings are vector representations of word that allow machines to learn semantic and syntactic meanings by performing computations on them. Two wellknown embedding models are CBOW and Skipgram. Different methods proposed to evaluate the quality of embeddings are categorized into extrinsic and intrinsic evaluation methods. This paper focuses on intrinsic evaluation - the evaluation of the models on tasks, such as analogy prediction, semantic relatedness, synonym detection, antonym detection and concept categorization. We present intrinsic evaluations on Bangla word embedding created using CBOW and Skipgram models on a Bangla corpus that we built. These are trained on more than 700,000 articles consisting of more than 1.3 million unique words with different embedding dimension sizes, e.g., 300, 100, 64, and 32. We created the evaluation datasets for the abovementioned tasks and performed a comprehensive evaluation. We observe, word vectors of dimension 300, produced using Skipgram models, achieves accuracy of 51.33% for analogy prediction, a correlation of 0.62 for semantic relatedness, and accuracy of 53.85% and 9.56% for synonym and antonym detection 9.56%. Finally, for concept categorization the accuracy is 91.02%. The corpus and evaluation datasets are made publicly available for further research.\",\"PeriodicalId\":413077,\"journal\":{\"name\":\"2019 International Conference on Bangla Speech and Language Processing (ICBSLP)\",\"volume\":\"19 10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Bangla Speech and Language Processing (ICBSLP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICBSLP47725.2019.201506\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Bangla Speech and Language Processing (ICBSLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBSLP47725.2019.201506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Word embeddings are vector representations of word that allow machines to learn semantic and syntactic meanings by performing computations on them. Two wellknown embedding models are CBOW and Skipgram. Different methods proposed to evaluate the quality of embeddings are categorized into extrinsic and intrinsic evaluation methods. This paper focuses on intrinsic evaluation - the evaluation of the models on tasks, such as analogy prediction, semantic relatedness, synonym detection, antonym detection and concept categorization. We present intrinsic evaluations on Bangla word embedding created using CBOW and Skipgram models on a Bangla corpus that we built. These are trained on more than 700,000 articles consisting of more than 1.3 million unique words with different embedding dimension sizes, e.g., 300, 100, 64, and 32. We created the evaluation datasets for the abovementioned tasks and performed a comprehensive evaluation. We observe, word vectors of dimension 300, produced using Skipgram models, achieves accuracy of 51.33% for analogy prediction, a correlation of 0.62 for semantic relatedness, and accuracy of 53.85% and 9.56% for synonym and antonym detection 9.56%. Finally, for concept categorization the accuracy is 91.02%. The corpus and evaluation datasets are made publicly available for further research.