Text Clustering Algorithm Based on Lexical Graph

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) Pub Date : 2007-08-24 DOI:10.1109/FSKD.2007.560

Yun Sha, Guoying Zhang, Huina Jiang

{"title":"Text Clustering Algorithm Based on Lexical Graph","authors":"Yun Sha, Guoying Zhang, Huina Jiang","doi":"10.1109/FSKD.2007.560","DOIUrl":null,"url":null,"abstract":"Text clustering methods can group text into thematic clusters, which is an important topic in many fields, such as search engine. The well-known methods of text clustering, however, do not really address the special problems of text clustering because of the very high dimensionality data and understandability of the cluster description. An algorithm for text clustering based on lexical graph is proposed in this paper, which is a kind of term-based cluster method. The lexical graph is build with nodes representing words and edges representing their concurrent in text. The attribute of each node is text which the word occurs in. A cluster center is defined as node (word) with large degree in this graph, the center attributes (text occurs in) and its neighbors' are partitioned to one cluster whose description is the center node. This approach reduces drastically the dimensionality of the data and improves the synonymy extension ability. An experimental evaluation on Web documents as well as classical text documents on demonstrates that the proposed algorithms obtain clustering of comparable quality significantly more efficiently than K-Means and STC algorithms on the search results data set. Furthermore, this method provides an understandable description of the discovered clusters by their center.","PeriodicalId":201883,"journal":{"name":"Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FSKD.2007.560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Text clustering methods can group text into thematic clusters, which is an important topic in many fields, such as search engine. The well-known methods of text clustering, however, do not really address the special problems of text clustering because of the very high dimensionality data and understandability of the cluster description. An algorithm for text clustering based on lexical graph is proposed in this paper, which is a kind of term-based cluster method. The lexical graph is build with nodes representing words and edges representing their concurrent in text. The attribute of each node is text which the word occurs in. A cluster center is defined as node (word) with large degree in this graph, the center attributes (text occurs in) and its neighbors' are partitioned to one cluster whose description is the center node. This approach reduces drastically the dimensionality of the data and improves the synonymy extension ability. An experimental evaluation on Web documents as well as classical text documents on demonstrates that the proposed algorithms obtain clustering of comparable quality significantly more efficiently than K-Means and STC algorithms on the search results data set. Furthermore, this method provides an understandable description of the discovered clusters by their center.

查看原文本刊更多论文

基于词汇图的文本聚类算法

文本聚类方法可以将文本分组成主题聚类，这在搜索引擎等许多领域都是一个重要的研究课题。然而，众所周知的文本聚类方法并没有真正解决文本聚类的特殊问题，因为数据的维数非常高，而且聚类描述的可理解性很高。本文提出了一种基于词汇图的文本聚类算法，这是一种基于术语的聚类方法。用节点表示单词，边表示它们在文本中的并发性来构建词汇图。每个节点的属性是单词出现的文本。聚类中心定义为图中度较大的节点(词)，中心属性(文本发生在)及其相邻属性被划分到一个描述为中心节点的聚类。这种方法大大降低了数据的维数，提高了同义词扩展能力。对Web文档和经典文本文档的实验评估表明，该算法在搜索结果数据集上获得的聚类质量明显高于K-Means和STC算法。此外，该方法还提供了可理解的星团中心描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)

自引率

0.00%

发文量