Document clustering via dirichlet process mixture model with feature selection

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2010-07-25 DOI:10.1145/1835804.1835901

Guan Yu, Rui-zhang Huang, Zhaojun Wang

引用次数: 61

Abstract

One essential issue of document clustering is to estimate the appropriate number of clusters for a document collection to which documents should be partitioned. In this paper, we propose a novel approach, namely DPMFS, to address this issue. The proposed approach is designed 1) to group documents into a set of clusters while the number of document clusters is determined by the Dirichlet process mixture model automatically; 2) to identify the discriminative words and separate them from irrelevant noise words via stochastic search variable selection technique. We explore the performance of our proposed approach on both a synthetic dataset and several realistic document datasets. The comparison between our proposed approach and stage-of-the-art document clustering approaches indicates that our approach is robust and effective for document clustering.

查看原文本刊更多论文

基于特征选择的dirichlet过程混合模型的文档聚类

文档聚类的一个基本问题是为文档集合估计适当的簇数，文档应该被划分到这些簇中。在本文中，我们提出了一种新的方法，即DPMFS，来解决这个问题。该方法设计为:1)将文档分组为一组簇，而簇的数量由Dirichlet过程混合模型自动确定;2)通过随机搜索变量选择技术识别判别词，并将其与无关噪声词分离。我们探索了我们提出的方法在合成数据集和几个实际文档数据集上的性能。我们提出的方法与目前最先进的文档聚类方法之间的比较表明，我们的方法对于文档聚类是鲁棒和有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量