Enhancement of the word2vec Class-Based Language Modeling by Optimizing the Features Vector Using PCA

2018 IEEE International Conference on Electro/Information Technology (EIT) Pub Date : 2018-05-03 DOI:10.1109/EIT.2018.8500303

Tiba Zaki Abdulhameed, I. Zitouni, I. Abdel-Qader

{"title":"Enhancement of the word2vec Class-Based Language Modeling by Optimizing the Features Vector Using PCA","authors":"Tiba Zaki Abdulhameed, I. Zitouni, I. Abdel-Qader","doi":"10.1109/EIT.2018.8500303","DOIUrl":null,"url":null,"abstract":"Neural word embedding, such as word2vec, produces very large features' vectors. In this paper, we are investigating the length of the feature vector aiming to optimize the word representation results, and also to speed up the algorithm by addressing noise impact. Principal Component Analysis (PCA) has a proven record in dimensionality reduction as we selected it to achieve our objectives. We also selected class based Language Modeling as extrinsic evaluation of the features vectors and are using Perplexity (pp) as our metric. K-means clustering is used as words classification. The execution time of the classification is also computed. As a result, we concluded that for a given test data, if the training data is of same domain then large vector size can increase the precision of describing word relations. In contrast, if the training data is from different domain and contains large amount of contexts not expected to occur in the test data then a small vector size will give a better description to help reducing the noise effect on clustering decisions. Two different data training domains were used in this analysis; Modern Standard Arabic (MSA) broadcast news and reports, and Iraqi phone conversations with testing data of the same Iraqi data domain. Depending on this analysis, same domain training data and test data have execution times reduced by 61% while keeping same representation efficiency. In addition, for different domain training data i.e. MSA, pp reduction ratio of 6.7% is achieved with time reduced by 92%. This implies the importance of carefully choosing feature vector size on the overall performance.","PeriodicalId":188414,"journal":{"name":"2018 IEEE International Conference on Electro/Information Technology (EIT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Electro/Information Technology (EIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EIT.2018.8500303","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Neural word embedding, such as word2vec, produces very large features' vectors. In this paper, we are investigating the length of the feature vector aiming to optimize the word representation results, and also to speed up the algorithm by addressing noise impact. Principal Component Analysis (PCA) has a proven record in dimensionality reduction as we selected it to achieve our objectives. We also selected class based Language Modeling as extrinsic evaluation of the features vectors and are using Perplexity (pp) as our metric. K-means clustering is used as words classification. The execution time of the classification is also computed. As a result, we concluded that for a given test data, if the training data is of same domain then large vector size can increase the precision of describing word relations. In contrast, if the training data is from different domain and contains large amount of contexts not expected to occur in the test data then a small vector size will give a better description to help reducing the noise effect on clustering decisions. Two different data training domains were used in this analysis; Modern Standard Arabic (MSA) broadcast news and reports, and Iraqi phone conversations with testing data of the same Iraqi data domain. Depending on this analysis, same domain training data and test data have execution times reduced by 61% while keeping same representation efficiency. In addition, for different domain training data i.e. MSA, pp reduction ratio of 6.7% is achieved with time reduced by 92%. This implies the importance of carefully choosing feature vector size on the overall performance.

查看原文本刊更多论文

基于PCA的word2vec类语言建模的特征向量优化

神经词嵌入，如word2vec，产生非常大的特征向量。在本文中，我们正在研究特征向量的长度，旨在优化单词表示结果，并通过解决噪声影响来加快算法。主成分分析(PCA)在降维方面有良好的记录，因为我们选择它来实现我们的目标。我们还选择了基于类的语言建模作为特征向量的外在评价，并使用Perplexity (pp)作为我们的度量。使用K-means聚类进行词分类。还计算了分类的执行时间。因此，我们得出结论，对于给定的测试数据，如果训练数据属于同一域，那么较大的向量大小可以提高描述词关系的精度。相反，如果训练数据来自不同的域，并且包含大量不期望在测试数据中出现的上下文，那么较小的向量大小将提供更好的描述，以帮助减少聚类决策中的噪声影响。在这个分析中使用了两个不同的数据训练域;现代标准阿拉伯语(MSA)用同一伊拉克数据域的测试数据广播新闻和报道，以及伊拉克电话对话。基于此分析，在保持相同表示效率的情况下，相同领域的训练数据和测试数据的执行时间减少了61%。此外，对于不同领域的训练数据，即MSA, pp减少率为6.7%，时间减少92%。这意味着仔细选择特征向量大小对整体性能的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE International Conference on Electro/Information Technology (EIT)

自引率

0.00%

发文量