Improving document representation using KPCA and clustered word embeddings

2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT) Pub Date : 2021-12-10 DOI:10.1109/ICEECCOT52851.2021.9707915

Aakansha Gupta, R. Katarya

{"title":"Improving document representation using KPCA and clustered word embeddings","authors":"Aakansha Gupta, R. Katarya","doi":"10.1109/ICEECCOT52851.2021.9707915","DOIUrl":null,"url":null,"abstract":"Text mining approaches have shown to be an efficient way to examine publicly available online information on a variety of topics. However, because of the lack of context, online texts are difficult to represent. Word embedding techniques, such as word2vec, capture semantic relationships between words particularly well when trained on large text collections; but they fail when trained on small datasets. This paper proposes a document representation method based on word clusters. In this approach, Word embeddings created by word2vec are supplemented with morphological information collected from kernel principal component analysis (KPCA) computed on word similarity. After running KPCA, a spherical k-means algorithm is applied, with centroid representing the topic of that cluster. Next, the document is vectorized using the frequencies of these clusters. The proposed approach successfully includes the effects of semantically and morphologically related terms on document proximity through these data-driven notions. When combined with an appropriate weighting system, the proposed approach enables improved document representation and interpretability of the resulted document vectors.","PeriodicalId":324627,"journal":{"name":"2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEECCOT52851.2021.9707915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Text mining approaches have shown to be an efficient way to examine publicly available online information on a variety of topics. However, because of the lack of context, online texts are difficult to represent. Word embedding techniques, such as word2vec, capture semantic relationships between words particularly well when trained on large text collections; but they fail when trained on small datasets. This paper proposes a document representation method based on word clusters. In this approach, Word embeddings created by word2vec are supplemented with morphological information collected from kernel principal component analysis (KPCA) computed on word similarity. After running KPCA, a spherical k-means algorithm is applied, with centroid representing the topic of that cluster. Next, the document is vectorized using the frequencies of these clusters. The proposed approach successfully includes the effects of semantically and morphologically related terms on document proximity through these data-driven notions. When combined with an appropriate weighting system, the proposed approach enables improved document representation and interpretability of the resulted document vectors.

查看原文本刊更多论文

使用KPCA和聚类词嵌入改进文档表示

文本挖掘方法已被证明是一种有效的方法来检查关于各种主题的公开可用的在线信息。然而，由于缺乏语境，网络文本难以表征。单词嵌入技术，如word2vec，在大型文本集上训练时，能很好地捕捉单词之间的语义关系;但当在小数据集上训练时，它们就失败了。提出了一种基于词簇的文档表示方法。在这种方法中，word2vec创建的词嵌入被从核主成分分析(KPCA)中收集的形态学信息所补充。在运行KPCA后，应用球面k-means算法，以质心表示该聚类的主题。接下来，使用这些聚类的频率对文档进行矢量化。所提出的方法通过这些数据驱动的概念成功地包含了语义和形态相关术语对文档接近度的影响。当与适当的加权系统相结合时，所建议的方法可以改进文档表示和结果文档向量的可解释性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT)

自引率

0.00%

发文量