Dynamic Document Clustering Using Singular Value Decomposition

Int. J. Comput. Model. Algorithms Medicine Pub Date : 2012-07-01 DOI:10.4018/jcmam.2012070103

Rashmi Nadubeediramesh, A. Gangopadhyay

{"title":"Dynamic Document Clustering Using Singular Value Decomposition","authors":"Rashmi Nadubeediramesh, A. Gangopadhyay","doi":"10.4018/jcmam.2012070103","DOIUrl":null,"url":null,"abstract":"Incremental document clustering is important in many applications, but particularly so in healthcare contexts where text data is found in abundance, ranging from published research in journals to day-to-day healthcare data such as discharge summaries and nursing notes. In such dynamic environments new documents are constantly added to the set of documents that have been used in the initial cluster formation. Hence it is important to be able to incrementally update the clusters at a low computational cost as new documents are added. In this paper the authors describe a novel, low cost approach for incremental document clustering. Their method is based on conducting singular value decomposition (SVD) incrementally. They dynamically fold in new documents into the existing term-document space and dynamically assign these new documents into pre-defined clusters based on intra-cluster similarity. This saves the cost of re-computing SVD on the entire document set every time updates occur. The authors also provide a way to retrieve documents based on different window sizes with high scalability and good clustering accuracy. They have tested their proposed method experimentally with 960 medical abstracts retrieved from the PubMed medical library. The authors’ incremental method is compared with the default situation where complete re-computation of SVD is done when new documents are added to the initial set of documents. The results show minor decreases in the quality of the cluster formation but much larger gains in computational throughput. Dynamic Document Clustering Using Singular Value Decomposition","PeriodicalId":162417,"journal":{"name":"Int. J. Comput. Model. Algorithms Medicine","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Model. Algorithms Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/jcmam.2012070103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Incremental document clustering is important in many applications, but particularly so in healthcare contexts where text data is found in abundance, ranging from published research in journals to day-to-day healthcare data such as discharge summaries and nursing notes. In such dynamic environments new documents are constantly added to the set of documents that have been used in the initial cluster formation. Hence it is important to be able to incrementally update the clusters at a low computational cost as new documents are added. In this paper the authors describe a novel, low cost approach for incremental document clustering. Their method is based on conducting singular value decomposition (SVD) incrementally. They dynamically fold in new documents into the existing term-document space and dynamically assign these new documents into pre-defined clusters based on intra-cluster similarity. This saves the cost of re-computing SVD on the entire document set every time updates occur. The authors also provide a way to retrieve documents based on different window sizes with high scalability and good clustering accuracy. They have tested their proposed method experimentally with 960 medical abstracts retrieved from the PubMed medical library. The authors’ incremental method is compared with the default situation where complete re-computation of SVD is done when new documents are added to the initial set of documents. The results show minor decreases in the quality of the cluster formation but much larger gains in computational throughput. Dynamic Document Clustering Using Singular Value Decomposition

查看原文本刊更多论文

基于奇异值分解的动态文档聚类

增量文档聚类在许多应用程序中都很重要，但在文本数据丰富的医疗保健环境中尤其如此，从期刊上发表的研究到出院摘要和护理笔记等日常医疗保健数据。在这种动态环境中，不断地将新文档添加到初始集群形成中使用的文档集中。因此，能够在添加新文档时以较低的计算成本增量更新集群是很重要的。在本文中，作者描述了一种新颖的、低成本的增量文档聚类方法。他们的方法是基于增量进行奇异值分解(SVD)。它们动态地将新文档折叠到现有的术语文档空间中，并根据簇内相似性将这些新文档动态地分配到预定义的簇中。这节省了每次发生更新时对整个文档集重新计算SVD的成本。作者还提供了一种基于不同窗口大小的检索文档的方法，具有高可扩展性和良好的聚类精度。他们用从PubMed医学图书馆检索的960篇医学摘要对他们提出的方法进行了实验测试。将作者的增量方法与添加新文档到初始文档集时完全重新计算SVD的默认情况进行了比较。结果表明，簇形成的质量略有下降，但计算吞吐量却有了很大的提高。基于奇异值分解的动态文档聚类

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Int. J. Comput. Model. Algorithms Medicine

自引率

0.00%

发文量