{"title":"Improving document representation using KPCA and clustered word embeddings","authors":"Aakansha Gupta, R. Katarya","doi":"10.1109/ICEECCOT52851.2021.9707915","DOIUrl":null,"url":null,"abstract":"Text mining approaches have shown to be an efficient way to examine publicly available online information on a variety of topics. However, because of the lack of context, online texts are difficult to represent. Word embedding techniques, such as word2vec, capture semantic relationships between words particularly well when trained on large text collections; but they fail when trained on small datasets. This paper proposes a document representation method based on word clusters. In this approach, Word embeddings created by word2vec are supplemented with morphological information collected from kernel principal component analysis (KPCA) computed on word similarity. After running KPCA, a spherical k-means algorithm is applied, with centroid representing the topic of that cluster. Next, the document is vectorized using the frequencies of these clusters. The proposed approach successfully includes the effects of semantically and morphologically related terms on document proximity through these data-driven notions. When combined with an appropriate weighting system, the proposed approach enables improved document representation and interpretability of the resulted document vectors.","PeriodicalId":324627,"journal":{"name":"2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEECCOT52851.2021.9707915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Text mining approaches have shown to be an efficient way to examine publicly available online information on a variety of topics. However, because of the lack of context, online texts are difficult to represent. Word embedding techniques, such as word2vec, capture semantic relationships between words particularly well when trained on large text collections; but they fail when trained on small datasets. This paper proposes a document representation method based on word clusters. In this approach, Word embeddings created by word2vec are supplemented with morphological information collected from kernel principal component analysis (KPCA) computed on word similarity. After running KPCA, a spherical k-means algorithm is applied, with centroid representing the topic of that cluster. Next, the document is vectorized using the frequencies of these clusters. The proposed approach successfully includes the effects of semantically and morphologically related terms on document proximity through these data-driven notions. When combined with an appropriate weighting system, the proposed approach enables improved document representation and interpretability of the resulted document vectors.