迈向多种垃圾邮件过滤技术的整合

2006 IEEE International Conference on Granular Computing Pub Date : 2006-05-10 DOI:10.1109/GRC.2006.1635746

C. Pu, Steve Webb, Oleg M. Kolesnikov, Wenke Lee, R. Lipton

{"title":"迈向多种垃圾邮件过滤技术的整合","authors":"C. Pu, Steve Webb, Oleg M. Kolesnikov, Wenke Lee, R. Lipton","doi":"10.1109/GRC.2006.1635746","DOIUrl":null,"url":null,"abstract":"Text-based spam filters (e.g., keyword and statistical learning filters) use tokens, which are found during message content analysis, to separate spam from legitimate messages. The effectiveness of these token-based filters is due to the presence of token signatures (i.e., tokens that are invariant for the many variants of spam messages). Unfortunately, it is relatively easy for spammers to hide or erase these signatures through simple techniques such as misspellings (to confuse keyword filters) and camouflage (i.e., combined spam and legitimate content used to confuse statistical filters). Our hypothesis is that spam contains additional signatures which are more difficult to hide. A concrete example of this type of signature is the presence of URLs in spam messages which are used to induce contact from their victims. We believe diverse spam filtering tools should be developed to incorporate these additional signatures. Thus, in this paper, we discuss a new type of URL-based filtering which can be integrated with existing spam filtering techniques to provide a more robust anti-spam solution. Our approach uses the syntactic constraints of URLs to find them in emails, and then, it uses semantic knowledge and tools (e.g., search engines) to refine and sharpen the spam identification process. email's routed path. In this paper, we focus our attention on spam messages that contain URLs and provide a novel approach for filtering these messages. The key observation is that most spam messages contain URLs which are \"live\" since the spammers would not be able to profit without a functioning link to their site. Thus, by checking the URLs found in a message and verifying a user's interest in the websites referenced by those URLs, we are able to add a new dimension to spam filtering. This paper has two main contributions. First, we describe three techniques for filtering email messages that contain URLs: URL category whitelists, URL regular expression whitelists, and dynamic classification of websites. Second, we describe a prototype implementation that takes advantage of these three techniques to help enhance spam filtering. Our pre- liminary results suggest that new dimensions in spam filtering (e.g., using URLs) deserve further exploration. However, due to space limitations, we have omitted our experimental results from this paper. The remainder of the paper is structured as follows. Sec- tion II gives an overview of the related work done in this research area. In Section III, we describe our approach, and Section IV discusses the details of our system's implementa- tion. We provide our conclusions in Section V.","PeriodicalId":400997,"journal":{"name":"2006 IEEE International Conference on Granular Computing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Towards the integration of diverse spam filtering techniques\",\"authors\":\"C. Pu, Steve Webb, Oleg M. Kolesnikov, Wenke Lee, R. Lipton\",\"doi\":\"10.1109/GRC.2006.1635746\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-based spam filters (e.g., keyword and statistical learning filters) use tokens, which are found during message content analysis, to separate spam from legitimate messages. The effectiveness of these token-based filters is due to the presence of token signatures (i.e., tokens that are invariant for the many variants of spam messages). Unfortunately, it is relatively easy for spammers to hide or erase these signatures through simple techniques such as misspellings (to confuse keyword filters) and camouflage (i.e., combined spam and legitimate content used to confuse statistical filters). Our hypothesis is that spam contains additional signatures which are more difficult to hide. A concrete example of this type of signature is the presence of URLs in spam messages which are used to induce contact from their victims. We believe diverse spam filtering tools should be developed to incorporate these additional signatures. Thus, in this paper, we discuss a new type of URL-based filtering which can be integrated with existing spam filtering techniques to provide a more robust anti-spam solution. Our approach uses the syntactic constraints of URLs to find them in emails, and then, it uses semantic knowledge and tools (e.g., search engines) to refine and sharpen the spam identification process. email's routed path. In this paper, we focus our attention on spam messages that contain URLs and provide a novel approach for filtering these messages. The key observation is that most spam messages contain URLs which are \\\"live\\\" since the spammers would not be able to profit without a functioning link to their site. Thus, by checking the URLs found in a message and verifying a user's interest in the websites referenced by those URLs, we are able to add a new dimension to spam filtering. This paper has two main contributions. First, we describe three techniques for filtering email messages that contain URLs: URL category whitelists, URL regular expression whitelists, and dynamic classification of websites. Second, we describe a prototype implementation that takes advantage of these three techniques to help enhance spam filtering. Our pre- liminary results suggest that new dimensions in spam filtering (e.g., using URLs) deserve further exploration. However, due to space limitations, we have omitted our experimental results from this paper. The remainder of the paper is structured as follows. Sec- tion II gives an overview of the related work done in this research area. In Section III, we describe our approach, and Section IV discusses the details of our system's implementa- tion. We provide our conclusions in Section V.\",\"PeriodicalId\":400997,\"journal\":{\"name\":\"2006 IEEE International Conference on Granular Computing\",\"volume\":\"82 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-05-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2006 IEEE International Conference on Granular Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/GRC.2006.1635746\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 IEEE International Conference on Granular Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRC.2006.1635746","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

基于文本的垃圾邮件过滤器(例如，关键字和统计学习过滤器)使用令牌(在消息内容分析期间发现)将垃圾邮件与合法消息分开。这些基于令牌的过滤器的有效性是由于令牌签名的存在(即，令牌对于垃圾邮件的许多变体都是不变的)。不幸的是，垃圾邮件发送者很容易通过一些简单的技术来隐藏或删除这些签名，比如拼写错误(混淆关键字过滤器)和伪装(即，将垃圾邮件和合法内容结合起来，混淆统计过滤器)。我们的假设是，垃圾邮件包含更难隐藏的附加签名。这种类型签名的一个具体例子是垃圾邮件中存在的url，这些url用于诱导受害者联系。我们认为应该开发各种垃圾邮件过滤工具来包含这些附加签名。因此，在本文中，我们讨论了一种新型的基于url的过滤，它可以与现有的垃圾邮件过滤技术集成，以提供更健壮的反垃圾邮件解决方案。我们的方法使用url的语法约束来在电子邮件中找到它们，然后，它使用语义知识和工具(例如，搜索引擎)来改进和提高垃圾邮件识别过程。邮件的路由路径。在本文中，我们将重点关注包含url的垃圾邮件，并提供一种过滤这些消息的新方法。关键的观察是，大多数垃圾邮件包含的网址是“活的”，因为垃圾邮件发送者将无法获利没有一个功能链接到他们的网站。因此，通过检查在消息中找到的url并验证用户对这些url引用的网站的兴趣，我们能够为垃圾邮件过滤添加一个新的维度。本文有两个主要贡献。首先，我们描述了过滤包含URL的电子邮件消息的三种技术:URL类别白名单、URL正则表达式白名单和网站动态分类。其次，我们描述了一个利用这三种技术来帮助增强垃圾邮件过滤的原型实现。我们的初步结果表明，垃圾邮件过滤的新维度(例如，使用url)值得进一步探索。但是由于篇幅限制，我们在本文中省略了我们的实验结果。本文的其余部分结构如下。第二节概述了在这一研究领域所做的相关工作。在第三部分，我们描述了我们的方法，第四部分讨论了我们系统实现的细节。我们在第五节中提供我们的结论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards the integration of diverse spam filtering techniques

Text-based spam filters (e.g., keyword and statistical learning filters) use tokens, which are found during message content analysis, to separate spam from legitimate messages. The effectiveness of these token-based filters is due to the presence of token signatures (i.e., tokens that are invariant for the many variants of spam messages). Unfortunately, it is relatively easy for spammers to hide or erase these signatures through simple techniques such as misspellings (to confuse keyword filters) and camouflage (i.e., combined spam and legitimate content used to confuse statistical filters). Our hypothesis is that spam contains additional signatures which are more difficult to hide. A concrete example of this type of signature is the presence of URLs in spam messages which are used to induce contact from their victims. We believe diverse spam filtering tools should be developed to incorporate these additional signatures. Thus, in this paper, we discuss a new type of URL-based filtering which can be integrated with existing spam filtering techniques to provide a more robust anti-spam solution. Our approach uses the syntactic constraints of URLs to find them in emails, and then, it uses semantic knowledge and tools (e.g., search engines) to refine and sharpen the spam identification process. email's routed path. In this paper, we focus our attention on spam messages that contain URLs and provide a novel approach for filtering these messages. The key observation is that most spam messages contain URLs which are "live" since the spammers would not be able to profit without a functioning link to their site. Thus, by checking the URLs found in a message and verifying a user's interest in the websites referenced by those URLs, we are able to add a new dimension to spam filtering. This paper has two main contributions. First, we describe three techniques for filtering email messages that contain URLs: URL category whitelists, URL regular expression whitelists, and dynamic classification of websites. Second, we describe a prototype implementation that takes advantage of these three techniques to help enhance spam filtering. Our pre- liminary results suggest that new dimensions in spam filtering (e.g., using URLs) deserve further exploration. However, due to space limitations, we have omitted our experimental results from this paper. The remainder of the paper is structured as follows. Sec- tion II gives an overview of the related work done in this research area. In Section III, we describe our approach, and Section IV discusses the details of our system's implementa- tion. We provide our conclusions in Section V.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2006 IEEE International Conference on Granular Computing

自引率

0.00%

发文量