Incremental document clustering using cluster similarity histograms

Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003) Pub Date : 2003-10-13 DOI:10.1109/WI.2003.1241276

Khaled M. Hammouda, M. Kamel

{"title":"Incremental document clustering using cluster similarity histograms","authors":"Khaled M. Hammouda, M. Kamel","doi":"10.1109/WI.2003.1241276","DOIUrl":null,"url":null,"abstract":"Clustering of large collections of text documents is a key process in providing a higher level of knowledge about the underlying inherent classification of the documents. Web documents, in particular, are of great interest since managing, accessing, searching, and browsing large repositories of Web content requires efficient organization. Incremental clustering algorithms are always preferred to traditional clustering techniques, since they can be applied in a dynamic environment such as the Web. An incremental document clustering algorithm is introduced, which relies only on pair-wise document similarity information. Clusters are represented using a cluster similarity histogram, a concise statistical representation of the distribution of similarities within each cluster, which provides a measure of cohesiveness. The measure guides the incremental clustering process. Complexity analysis and experimental results are discussed and show that the algorithm requires less computational time than standard methods while achieving a comparable or better clustering quality.","PeriodicalId":403574,"journal":{"name":"Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)","volume":"56 12","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"59","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2003.1241276","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 59

Abstract

Clustering of large collections of text documents is a key process in providing a higher level of knowledge about the underlying inherent classification of the documents. Web documents, in particular, are of great interest since managing, accessing, searching, and browsing large repositories of Web content requires efficient organization. Incremental clustering algorithms are always preferred to traditional clustering techniques, since they can be applied in a dynamic environment such as the Web. An incremental document clustering algorithm is introduced, which relies only on pair-wise document similarity information. Clusters are represented using a cluster similarity histogram, a concise statistical representation of the distribution of similarities within each cluster, which provides a measure of cohesiveness. The measure guides the incremental clustering process. Complexity analysis and experimental results are discussed and show that the algorithm requires less computational time than standard methods while achieving a comparable or better clustering quality.

查看原文本刊更多论文

使用聚类相似直方图的增量文档聚类

大型文本文档集合的聚类是提供关于文档底层固有分类的更高层次知识的关键过程。Web文档尤其重要，因为管理、访问、搜索和浏览大型Web内容存储库需要高效的组织。增量聚类算法总是优于传统聚类技术，因为它们可以应用于Web等动态环境。介绍了一种增量文档聚类算法，该算法仅依赖于成对文档相似度信息。聚类使用聚类相似性直方图表示，这是每个聚类中相似性分布的简明统计表示，它提供了内聚性的度量。该度量指导增量聚类过程。对复杂度分析和实验结果进行了讨论，结果表明该算法比标准方法所需的计算时间更少，同时获得了相当或更好的聚类质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)

自引率

0.00%

发文量