Efficient clustering of short text streams using online-offline clustering

Proceedings of the 21st ACM Symposium on Document Engineering Pub Date : 2021-08-16 DOI:10.1145/3469096.3469866

Md. Rashadul Hasan Rakib, N. Zeh, E. Milios

{"title":"Efficient clustering of short text streams using online-offline clustering","authors":"Md. Rashadul Hasan Rakib, N. Zeh, E. Milios","doi":"10.1145/3469096.3469866","DOIUrl":null,"url":null,"abstract":"Short text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. The two major challenges of clustering such massive amount of text is to cluster them within a reasonable amount of time and to achieve better clustering result. To overcome these two challenges, we propose an efficient short text stream clustering algorithm (called EStream) consisting of two modules: online and offline. The online module of EStream algorithm assigns a text to a cluster one by one as it arrives. To assign a text to a cluster it computes similarity between a text and a selected number of clusters instead of all clusters and thus significantly reduces the running time of the clustering of short text streams. EStream assigns a text to a cluster (new or existing) using the dynamically computed similarity thresholds. Thus EStream efficiently deals with the concept drift problem. The offline module of EStream algorithm enhances the distributions of texts in the clusters obtained by the online module so that the upcoming short texts can be assigned to the appropriate clusters. Experimental results demonstrate that EStream outperforms the state-of-the-art short text stream clustering methods (in terms of clustering result) by a statistically significant margin on several short text datasets. Moreover, the running time of EStream is several orders of magnitude faster than that of the state-of-the-art methods.","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3469096.3469866","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Short text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. The two major challenges of clustering such massive amount of text is to cluster them within a reasonable amount of time and to achieve better clustering result. To overcome these two challenges, we propose an efficient short text stream clustering algorithm (called EStream) consisting of two modules: online and offline. The online module of EStream algorithm assigns a text to a cluster one by one as it arrives. To assign a text to a cluster it computes similarity between a text and a selected number of clusters instead of all clusters and thus significantly reduces the running time of the clustering of short text streams. EStream assigns a text to a cluster (new or existing) using the dynamically computed similarity thresholds. Thus EStream efficiently deals with the concept drift problem. The offline module of EStream algorithm enhances the distributions of texts in the clusters obtained by the online module so that the upcoming short texts can be assigned to the appropriate clusters. Experimental results demonstrate that EStream outperforms the state-of-the-art short text stream clustering methods (in terms of clustering result) by a statistically significant margin on several short text datasets. Moreover, the running time of EStream is several orders of magnitude faster than that of the state-of-the-art methods.

查看原文本刊更多论文

使用在线-离线聚类的短文本流的高效聚类

短文本流聚类是一项重要但具有挑战性的任务，因为大量文本来自不同的来源，如微博、问答和社会新闻聚合网站。对如此大量的文本进行聚类的两个主要挑战是在合理的时间内对它们进行聚类，并获得更好的聚类结果。为了克服这两个挑战，我们提出了一种高效的短文本流聚类算法(称为EStream)，该算法由在线和离线两个模块组成。EStream算法的在线模块在文本到达时将文本逐个分配给集群。为了将文本分配给集群，它计算文本与选定数量的集群之间的相似性，而不是所有集群，从而显着减少了短文本流聚类的运行时间。EStream使用动态计算的相似性阈值将文本分配给集群(新的或现有的)。EStream有效地解决了概念漂移问题。EStream算法的离线模块增强了在线模块得到的聚类中文本的分布，以便将即将到来的短文本分配到合适的聚类中。实验结果表明，EStream在几个短文本数据集上优于最先进的短文本流聚类方法(就聚类结果而言)，在统计上有显著的差距。此外，EStream的运行时间比最先进的方法快几个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 21st ACM Symposium on Document Engineering

自引率

0.00%

发文量