Document Clustering via Matrix Representation

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI:10.1109/ICDM.2011.59

Xufei Wang, Jiliang Tang, Huan Liu

{"title":"Document Clustering via Matrix Representation","authors":"Xufei Wang, Jiliang Tang, Huan Liu","doi":"10.1109/ICDM.2011.59","DOIUrl":null,"url":null,"abstract":"Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^\\top, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66\\% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 11th International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2011.59","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^\top, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66\% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.

查看原文本刊更多论文

基于矩阵表示的文档聚类

向量空间模型(VSM)被广泛用于表示文档和网页。它在计算上很容易处理，但它也将文档过度简化为矢量，容易受到噪声的影响，并且不能显式地表示文档的底层主题。本文提出了文档的矩阵表示:行表示不同的项，列表示内聚的段。矩阵模型将文档视为一组片段，每个片段是有限数量的潜在主题的概率分布，这些潜在主题可以映射到聚类结构。基于文档矩阵表示的潜在主题提取被表述为约束优化问题，其中每个矩阵(即文档)A_i被分解为由非负矩阵L和R^\top确定的公共基，以及一个非负权重矩阵M_i，使得所有文档的重构误差总和最小。实证评价表明，矩阵模型在文档聚类中是可行的:(1)与向量表示相比，矩阵表示能持续提高聚类质量，在研究的数据集上，该方法的相对准确率提高了66%;(2)该方法优于k-means和NMF等基准方法，是LDA和PLSI等最先进方法的补充。此外，所提出的矩阵模型允许在段级别而不是在文档级别进行更精细的信息检索，这使得在信息检索任务中可以返回更多相关的文档。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE 11th International Conference on Data Mining

自引率

0.00%

发文量