A Novel Approach for Email Clustering Based on Semantics

2014 11th Web Information System and Application Conference Pub Date : 2014-09-12 DOI:10.1109/WISA.2014.56

Binlai He, Zefeng Li, Nan Yang

引用次数: 5

Abstract

An increasing interest has been recently devoted to clustering short documents. Short documents don't contain enough text to compute similarities accurately by implementing the most widely used technique called Vector Space Model (VSM). Adding semantics to short documents clustering is one efficient way to solve this problem. However, real life collections are often composed of very short or long documents. For example, the length of email messages for each email user follows a power-law distribution. Long emails and short emails both appear in email corpus. Therefore, both state-of-the-art short documents and long document clustering approaches can't get a high cluster quality or high efficiency in short and long documents clustering. In order to solve this problem, we propose a novel approach for email clustering based on semantics. Empirical validation shows that our method can obtain high cluster quality and high efficiency in real world email datasets.

查看原文本刊更多论文

基于语义的电子邮件聚类新方法

最近，人们对短文档的聚类越来越感兴趣。短文档不包含足够的文本，无法通过实现最广泛使用的称为向量空间模型(VSM)的技术来精确计算相似度。在短文档聚类中添加语义是解决这个问题的一种有效方法。然而，现实生活中的集合通常由很短或很长的文档组成。例如，每个电子邮件用户的电子邮件消息长度遵循幂律分布。长邮件和短邮件都出现在邮件语料库中。因此，无论是目前最先进的短文档聚类方法还是长文档聚类方法，都无法在短文档和长文档聚类中获得高的聚类质量和高效率。为了解决这一问题，我们提出了一种基于语义的电子邮件聚类方法。经验验证表明，该方法可以在真实的电子邮件数据集上获得较高的聚类质量和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 11th Web Information System and Application Conference

自引率

0.00%

发文量