Adaptive spam filtering using dynamic feature space

17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05) Pub Date : 2005-11-14 DOI:10.1109/ICTAI.2005.28

Yan Zhou, M. Mulekar, Praveen Nerellapalli

引用次数: 28

Abstract

Unsolicited bulk e-mail, also known as spam, has been an increasing problem for the e-mail society. This paper presents a new spam filtering strategy that 1) uses a practical entropy coding technique, Huffman coding, to dynamically encode the feature space of e-mail collections over time and, 2) applies an online algorithm to adaptively enhance the learned spam concept as new e-mail data becomes available. The contributions of this work include a highly efficient spam filtering algorithm in which the input space is radically reduced to a single-dimension input vector, and an adaptive learning technique that is robust to vocabulary change, concept drifting and skewed data distribution. We compare our technique to several existing off-line learning techniques including support vector machine, naive Bayes, k-nearest neighbor, C4.5 decision tree, RBFNetwork, boosted decision tree and stacking, and demonstrate the effectiveness of our technique by presenting the experimental results on the e-mail data that is publicly available

查看原文本刊更多论文

使用动态特征空间的自适应垃圾邮件过滤

未经请求的大量电子邮件，也被称为垃圾邮件，已经成为电子邮件社会日益严重的问题。本文提出了一种新的垃圾邮件过滤策略，1)使用实用的熵编码技术，霍夫曼编码，随着时间的推移动态编码电子邮件集合的特征空间;2)应用在线算法，自适应地增强学习到的垃圾邮件概念，当新的电子邮件数据可用时。这项工作的贡献包括一种高效的垃圾邮件过滤算法，该算法将输入空间从根本上简化为单维输入向量，以及一种自适应学习技术，该技术对词汇变化、概念漂移和倾斜数据分布具有鲁棒性。我们将我们的技术与现有的几种离线学习技术进行了比较，包括支持向量机、朴素贝叶斯、k近邻、C4.5决策树、RBFNetwork、增强决策树和堆叠，并通过在公开可用的电子邮件数据上展示实验结果来证明我们技术的有效性

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05)

自引率

0.00%

发文量