Incremental document clustering using cluster similarity histograms

Khaled M. Hammouda, M. Kamel
{"title":"Incremental document clustering using cluster similarity histograms","authors":"Khaled M. Hammouda, M. Kamel","doi":"10.1109/WI.2003.1241276","DOIUrl":null,"url":null,"abstract":"Clustering of large collections of text documents is a key process in providing a higher level of knowledge about the underlying inherent classification of the documents. Web documents, in particular, are of great interest since managing, accessing, searching, and browsing large repositories of Web content requires efficient organization. Incremental clustering algorithms are always preferred to traditional clustering techniques, since they can be applied in a dynamic environment such as the Web. An incremental document clustering algorithm is introduced, which relies only on pair-wise document similarity information. Clusters are represented using a cluster similarity histogram, a concise statistical representation of the distribution of similarities within each cluster, which provides a measure of cohesiveness. The measure guides the incremental clustering process. Complexity analysis and experimental results are discussed and show that the algorithm requires less computational time than standard methods while achieving a comparable or better clustering quality.","PeriodicalId":403574,"journal":{"name":"Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)","volume":"56 12","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"59","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2003.1241276","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 59

Abstract

Clustering of large collections of text documents is a key process in providing a higher level of knowledge about the underlying inherent classification of the documents. Web documents, in particular, are of great interest since managing, accessing, searching, and browsing large repositories of Web content requires efficient organization. Incremental clustering algorithms are always preferred to traditional clustering techniques, since they can be applied in a dynamic environment such as the Web. An incremental document clustering algorithm is introduced, which relies only on pair-wise document similarity information. Clusters are represented using a cluster similarity histogram, a concise statistical representation of the distribution of similarities within each cluster, which provides a measure of cohesiveness. The measure guides the incremental clustering process. Complexity analysis and experimental results are discussed and show that the algorithm requires less computational time than standard methods while achieving a comparable or better clustering quality.
使用聚类相似直方图的增量文档聚类
大型文本文档集合的聚类是提供关于文档底层固有分类的更高层次知识的关键过程。Web文档尤其重要,因为管理、访问、搜索和浏览大型Web内容存储库需要高效的组织。增量聚类算法总是优于传统聚类技术,因为它们可以应用于Web等动态环境。介绍了一种增量文档聚类算法,该算法仅依赖于成对文档相似度信息。聚类使用聚类相似性直方图表示,这是每个聚类中相似性分布的简明统计表示,它提供了内聚性的度量。该度量指导增量聚类过程。对复杂度分析和实验结果进行了讨论,结果表明该算法比标准方法所需的计算时间更少,同时获得了相当或更好的聚类质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信