Efficient clustering of short text streams using online-offline clustering

Md. Rashadul Hasan Rakib, N. Zeh, E. Milios
{"title":"Efficient clustering of short text streams using online-offline clustering","authors":"Md. Rashadul Hasan Rakib, N. Zeh, E. Milios","doi":"10.1145/3469096.3469866","DOIUrl":null,"url":null,"abstract":"Short text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. The two major challenges of clustering such massive amount of text is to cluster them within a reasonable amount of time and to achieve better clustering result. To overcome these two challenges, we propose an efficient short text stream clustering algorithm (called EStream) consisting of two modules: online and offline. The online module of EStream algorithm assigns a text to a cluster one by one as it arrives. To assign a text to a cluster it computes similarity between a text and a selected number of clusters instead of all clusters and thus significantly reduces the running time of the clustering of short text streams. EStream assigns a text to a cluster (new or existing) using the dynamically computed similarity thresholds. Thus EStream efficiently deals with the concept drift problem. The offline module of EStream algorithm enhances the distributions of texts in the clusters obtained by the online module so that the upcoming short texts can be assigned to the appropriate clusters. Experimental results demonstrate that EStream outperforms the state-of-the-art short text stream clustering methods (in terms of clustering result) by a statistically significant margin on several short text datasets. Moreover, the running time of EStream is several orders of magnitude faster than that of the state-of-the-art methods.","PeriodicalId":423462,"journal":{"name":"Proceedings of the 21st ACM Symposium on Document Engineering","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3469096.3469866","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Short text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. The two major challenges of clustering such massive amount of text is to cluster them within a reasonable amount of time and to achieve better clustering result. To overcome these two challenges, we propose an efficient short text stream clustering algorithm (called EStream) consisting of two modules: online and offline. The online module of EStream algorithm assigns a text to a cluster one by one as it arrives. To assign a text to a cluster it computes similarity between a text and a selected number of clusters instead of all clusters and thus significantly reduces the running time of the clustering of short text streams. EStream assigns a text to a cluster (new or existing) using the dynamically computed similarity thresholds. Thus EStream efficiently deals with the concept drift problem. The offline module of EStream algorithm enhances the distributions of texts in the clusters obtained by the online module so that the upcoming short texts can be assigned to the appropriate clusters. Experimental results demonstrate that EStream outperforms the state-of-the-art short text stream clustering methods (in terms of clustering result) by a statistically significant margin on several short text datasets. Moreover, the running time of EStream is several orders of magnitude faster than that of the state-of-the-art methods.
使用在线-离线聚类的短文本流的高效聚类
短文本流聚类是一项重要但具有挑战性的任务,因为大量文本来自不同的来源,如微博、问答和社会新闻聚合网站。对如此大量的文本进行聚类的两个主要挑战是在合理的时间内对它们进行聚类,并获得更好的聚类结果。为了克服这两个挑战,我们提出了一种高效的短文本流聚类算法(称为EStream),该算法由在线和离线两个模块组成。EStream算法的在线模块在文本到达时将文本逐个分配给集群。为了将文本分配给集群,它计算文本与选定数量的集群之间的相似性,而不是所有集群,从而显着减少了短文本流聚类的运行时间。EStream使用动态计算的相似性阈值将文本分配给集群(新的或现有的)。EStream有效地解决了概念漂移问题。EStream算法的离线模块增强了在线模块得到的聚类中文本的分布,以便将即将到来的短文本分配到合适的聚类中。实验结果表明,EStream在几个短文本数据集上优于最先进的短文本流聚类方法(就聚类结果而言),在统计上有显著的差距。此外,EStream的运行时间比最先进的方法快几个数量级。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信