Improving document representation using KPCA and clustered word embeddings

Aakansha Gupta, R. Katarya
{"title":"Improving document representation using KPCA and clustered word embeddings","authors":"Aakansha Gupta, R. Katarya","doi":"10.1109/ICEECCOT52851.2021.9707915","DOIUrl":null,"url":null,"abstract":"Text mining approaches have shown to be an efficient way to examine publicly available online information on a variety of topics. However, because of the lack of context, online texts are difficult to represent. Word embedding techniques, such as word2vec, capture semantic relationships between words particularly well when trained on large text collections; but they fail when trained on small datasets. This paper proposes a document representation method based on word clusters. In this approach, Word embeddings created by word2vec are supplemented with morphological information collected from kernel principal component analysis (KPCA) computed on word similarity. After running KPCA, a spherical k-means algorithm is applied, with centroid representing the topic of that cluster. Next, the document is vectorized using the frequencies of these clusters. The proposed approach successfully includes the effects of semantically and morphologically related terms on document proximity through these data-driven notions. When combined with an appropriate weighting system, the proposed approach enables improved document representation and interpretability of the resulted document vectors.","PeriodicalId":324627,"journal":{"name":"2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEECCOT52851.2021.9707915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Text mining approaches have shown to be an efficient way to examine publicly available online information on a variety of topics. However, because of the lack of context, online texts are difficult to represent. Word embedding techniques, such as word2vec, capture semantic relationships between words particularly well when trained on large text collections; but they fail when trained on small datasets. This paper proposes a document representation method based on word clusters. In this approach, Word embeddings created by word2vec are supplemented with morphological information collected from kernel principal component analysis (KPCA) computed on word similarity. After running KPCA, a spherical k-means algorithm is applied, with centroid representing the topic of that cluster. Next, the document is vectorized using the frequencies of these clusters. The proposed approach successfully includes the effects of semantically and morphologically related terms on document proximity through these data-driven notions. When combined with an appropriate weighting system, the proposed approach enables improved document representation and interpretability of the resulted document vectors.
使用KPCA和聚类词嵌入改进文档表示
文本挖掘方法已被证明是一种有效的方法来检查关于各种主题的公开可用的在线信息。然而,由于缺乏语境,网络文本难以表征。单词嵌入技术,如word2vec,在大型文本集上训练时,能很好地捕捉单词之间的语义关系;但当在小数据集上训练时,它们就失败了。提出了一种基于词簇的文档表示方法。在这种方法中,word2vec创建的词嵌入被从核主成分分析(KPCA)中收集的形态学信息所补充。在运行KPCA后,应用球面k-means算法,以质心表示该聚类的主题。接下来,使用这些聚类的频率对文档进行矢量化。所提出的方法通过这些数据驱动的概念成功地包含了语义和形态相关术语对文档接近度的影响。当与适当的加权系统相结合时,所建议的方法可以改进文档表示和结果文档向量的可解释性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信