XML Documents Clustering Using Tensor Space Model -- A Preliminary Study

2010 IEEE International Conference on Data Mining Workshops Pub Date : 2010-12-13 DOI:10.1109/ICDMW.2010.106

Sangeetha Kutty, R. Nayak, Yuefeng Li

引用次数: 4

Abstract

A hierarchical structure is used to represent the content of the semi-structured documents such as XML and XHTML. The traditional Vector Space Model (VSM) is not sufficient to represent both the structure and the content of such web documents. Hence in this paper, we introduce a novel method of representing the XML documents in Tensor Space Model (TSM) and then utilize it for clustering. Empirical analysis shows that the proposed method is scalable for a real-life dataset as well as the factorized matrices produced from the proposed method helps to improve the quality of clusters due to the enriched document representation with both the structure and the content information.

查看原文本刊更多论文

基于张量空间模型的XML文档聚类初探

层次结构用于表示半结构化文档(如XML和XHTML)的内容。传统的向量空间模型(VSM)不足以同时表示这类web文档的结构和内容。为此，本文提出了一种用张量空间模型(TSM)表示XML文档的新方法，并将其用于聚类。实证分析表明，该方法对现实数据集具有可扩展性，并且由于该方法生成的分解矩阵具有丰富的文档表示结构和内容信息，有助于提高聚类的质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE International Conference on Data Mining Workshops

自引率

0.00%

发文量