Enhancement of the word2vec Class-Based Language Modeling by Optimizing the Features Vector Using PCA

Tiba Zaki Abdulhameed, I. Zitouni, I. Abdel-Qader
{"title":"Enhancement of the word2vec Class-Based Language Modeling by Optimizing the Features Vector Using PCA","authors":"Tiba Zaki Abdulhameed, I. Zitouni, I. Abdel-Qader","doi":"10.1109/EIT.2018.8500303","DOIUrl":null,"url":null,"abstract":"Neural word embedding, such as word2vec, produces very large features' vectors. In this paper, we are investigating the length of the feature vector aiming to optimize the word representation results, and also to speed up the algorithm by addressing noise impact. Principal Component Analysis (PCA) has a proven record in dimensionality reduction as we selected it to achieve our objectives. We also selected class based Language Modeling as extrinsic evaluation of the features vectors and are using Perplexity (pp) as our metric. K-means clustering is used as words classification. The execution time of the classification is also computed. As a result, we concluded that for a given test data, if the training data is of same domain then large vector size can increase the precision of describing word relations. In contrast, if the training data is from different domain and contains large amount of contexts not expected to occur in the test data then a small vector size will give a better description to help reducing the noise effect on clustering decisions. Two different data training domains were used in this analysis; Modern Standard Arabic (MSA) broadcast news and reports, and Iraqi phone conversations with testing data of the same Iraqi data domain. Depending on this analysis, same domain training data and test data have execution times reduced by 61% while keeping same representation efficiency. In addition, for different domain training data i.e. MSA, pp reduction ratio of 6.7% is achieved with time reduced by 92%. This implies the importance of carefully choosing feature vector size on the overall performance.","PeriodicalId":188414,"journal":{"name":"2018 IEEE International Conference on Electro/Information Technology (EIT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Electro/Information Technology (EIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EIT.2018.8500303","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Neural word embedding, such as word2vec, produces very large features' vectors. In this paper, we are investigating the length of the feature vector aiming to optimize the word representation results, and also to speed up the algorithm by addressing noise impact. Principal Component Analysis (PCA) has a proven record in dimensionality reduction as we selected it to achieve our objectives. We also selected class based Language Modeling as extrinsic evaluation of the features vectors and are using Perplexity (pp) as our metric. K-means clustering is used as words classification. The execution time of the classification is also computed. As a result, we concluded that for a given test data, if the training data is of same domain then large vector size can increase the precision of describing word relations. In contrast, if the training data is from different domain and contains large amount of contexts not expected to occur in the test data then a small vector size will give a better description to help reducing the noise effect on clustering decisions. Two different data training domains were used in this analysis; Modern Standard Arabic (MSA) broadcast news and reports, and Iraqi phone conversations with testing data of the same Iraqi data domain. Depending on this analysis, same domain training data and test data have execution times reduced by 61% while keeping same representation efficiency. In addition, for different domain training data i.e. MSA, pp reduction ratio of 6.7% is achieved with time reduced by 92%. This implies the importance of carefully choosing feature vector size on the overall performance.
基于PCA的word2vec类语言建模的特征向量优化
神经词嵌入,如word2vec,产生非常大的特征向量。在本文中,我们正在研究特征向量的长度,旨在优化单词表示结果,并通过解决噪声影响来加快算法。主成分分析(PCA)在降维方面有良好的记录,因为我们选择它来实现我们的目标。我们还选择了基于类的语言建模作为特征向量的外在评价,并使用Perplexity (pp)作为我们的度量。使用K-means聚类进行词分类。还计算了分类的执行时间。因此,我们得出结论,对于给定的测试数据,如果训练数据属于同一域,那么较大的向量大小可以提高描述词关系的精度。相反,如果训练数据来自不同的域,并且包含大量不期望在测试数据中出现的上下文,那么较小的向量大小将提供更好的描述,以帮助减少聚类决策中的噪声影响。在这个分析中使用了两个不同的数据训练域;现代标准阿拉伯语(MSA)用同一伊拉克数据域的测试数据广播新闻和报道,以及伊拉克电话对话。基于此分析,在保持相同表示效率的情况下,相同领域的训练数据和测试数据的执行时间减少了61%。此外,对于不同领域的训练数据,即MSA, pp减少率为6.7%,时间减少92%。这意味着仔细选择特征向量大小对整体性能的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信