Explainable Graph Spectral Clustering of text documents.

IF 2.6 3区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

PLoS ONE Pub Date : 2025-02-04 eCollection Date: 2025-01-01 DOI:10.1371/journal.pone.0313238

Bartłomiej Starosta, Mieczysław A Kłopotek, Sławomir T Wierzchoń, Dariusz Czerski, Marcin Sydow, Piotr Borkowski

{"title":"Explainable Graph Spectral Clustering of text documents.","authors":"Bartłomiej Starosta, Mieczysław A Kłopotek, Sławomir T Wierzchoń, Dariusz Czerski, Marcin Sydow, Piotr Borkowski","doi":"10.1371/journal.pone.0313238","DOIUrl":null,"url":null,"abstract":"<p><p>Spectral clustering methods are known for their ability to represent clusters of diverse shapes, densities etc. However, the results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relation to document contents. Therefore, there is an urgent need to elaborate methods for explaining the outcome of the clustering. We have constructed in this paper a theoretical bridge linking the clusters resulting from Graph Spectral Clustering and the actual document content, given that similarities between documents are computed as cosine measures in tf or tfidf representation. This link enables to provide with explanation of cluster membership in clusters produced by GSA. We present a proposal of explanation of the results of combinatorial and normalized Laplacian based graph spectral clustering. For this purpose, we show (approximate) equivalence of combinatorial Laplacian embedding and of K-embedding (proposed in this paper) and term vector space embedding. We performed an experimental study showing that K-embedding approximates well Laplacian embedding under favourable block matrix conditions and show that approximation is good enough under other conditions. We show also perfect equivalence of normalized Laplacian embedding and the [Formula: see text]-embedding (proposed in this paper) and (weighted) term vector space embedding. Hence a bridge is constructed between the textual contents and the clustering results using both combinatorial and normalized Laplacian based Graph Spectral Clustering methods. We provide a theoretical background for our approach. An initial version of this paper is available at arXiv, (Starosta B 2023). The Reader may refer to that text to get acquainted with formal aspects of our method and find a detailed overview of motivation.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 2","pages":"e0313238"},"PeriodicalIF":2.6000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11793795/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0313238","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Spectral clustering methods are known for their ability to represent clusters of diverse shapes, densities etc. However, the results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relation to document contents. Therefore, there is an urgent need to elaborate methods for explaining the outcome of the clustering. We have constructed in this paper a theoretical bridge linking the clusters resulting from Graph Spectral Clustering and the actual document content, given that similarities between documents are computed as cosine measures in tf or tfidf representation. This link enables to provide with explanation of cluster membership in clusters produced by GSA. We present a proposal of explanation of the results of combinatorial and normalized Laplacian based graph spectral clustering. For this purpose, we show (approximate) equivalence of combinatorial Laplacian embedding and of K-embedding (proposed in this paper) and term vector space embedding. We performed an experimental study showing that K-embedding approximates well Laplacian embedding under favourable block matrix conditions and show that approximation is good enough under other conditions. We show also perfect equivalence of normalized Laplacian embedding and the [Formula: see text]-embedding (proposed in this paper) and (weighted) term vector space embedding. Hence a bridge is constructed between the textual contents and the clustering results using both combinatorial and normalized Laplacian based Graph Spectral Clustering methods. We provide a theoretical background for our approach. An initial version of this paper is available at arXiv, (Starosta B 2023). The Reader may refer to that text to get acquainted with formal aspects of our method and find a detailed overview of motivation.

Abstract Image

查看原文本刊更多论文

文本文档的可解释图谱聚类。

光谱聚类方法以其表示不同形状、密度等的簇的能力而闻名。然而，这些算法的结果，当应用于文本文档时，很难向用户解释，特别是由于嵌入在光谱空间中，与文档内容没有明显的关系。因此，迫切需要阐述解释聚类结果的方法。考虑到文档之间的相似性在tf或tfidf表示中被计算为余弦度量，我们在本文中构建了一个理论桥梁，将图谱聚类产生的聚类与实际文档内容连接起来。这个链接能够提供GSA生成的集群中的集群成员的解释。提出了一种基于组合和归一化拉普拉斯的图谱聚类结果的解释建议。为此，我们证明了组合拉普拉斯嵌入与k嵌入（本文提出）和项向量空间嵌入的（近似）等价。我们进行的实验研究表明，k嵌入在有利的块矩阵条件下可以很好地近似拉普拉斯嵌入，并且在其他条件下也可以很好地近似。我们还证明了归一化拉普拉斯嵌入与[公式：见文本]的完全等价——嵌入（本文提出）和（加权）项向量空间嵌入。因此，在文本内容和聚类结果之间构建了一个桥梁，使用组合和归一化的基于拉普拉斯的图谱聚类方法。我们为我们的方法提供了理论背景。本文的初始版本可在arXiv上获得（Starosta B 2023）。读者可以参考那篇文章来熟悉我们方法的形式方面，并找到动机的详细概述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLoS ONE 生物-生物学

CiteScore

6.20

自引率

5.40%

发文量

14242

审稿时长

3.7 months

期刊介绍： PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides: * Open-access—freely accessible online, authors retain copyright * Fast publication times * Peer review by expert, practicing researchers * Post-publication tools to indicate quality and impact * Community-based dialogue on articles * Worldwide media coverage