垃圾邮件过滤:一个系统的审查

IF 12.9 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval Pub Date : 2008-06-23 DOI:10.1561/1500000006

G. Cormack

{"title":"垃圾邮件过滤:一个系统的审查","authors":"G. Cormack","doi":"10.1561/1500000006","DOIUrl":null,"url":null,"abstract":"Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than \"I know it when I see it.\" Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? \n \nWe survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"5 1","pages":"335-455"},"PeriodicalIF":12.9000,"publicationDate":"2008-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"296","resultStr":"{\"title\":\"Email Spam Filtering: A Systematic Review\",\"authors\":\"G. Cormack\",\"doi\":\"10.1561/1500000006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than \\\"I know it when I see it.\\\" Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? \\n \\nWe survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.\",\"PeriodicalId\":48829,\"journal\":{\"name\":\"Foundations and Trends in Information Retrieval\",\"volume\":\"5 1\",\"pages\":\"335-455\"},\"PeriodicalIF\":12.9000,\"publicationDate\":\"2008-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"296\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Foundations and Trends in Information Retrieval\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1561/1500000006\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foundations and Trends in Information Retrieval","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1561/1500000006","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 296

摘要

垃圾邮件是一种精心设计的信息，不管收件人的意愿如何，都要发送给大量收件人。垃圾邮件过滤器是一种自动识别垃圾邮件以防止其传递的工具。垃圾邮件和垃圾邮件过滤器的目的是截然相反的:如果垃圾邮件避开了过滤器，那么它就是有效的，而如果过滤器识别了垃圾邮件，那么它就是有效的。这些定义的循环性质，以及它们对发送者和接收者意图的吸引力，使它们难以形式化。典型的电子邮件用户的工作定义不会比“当我看到它时我就知道了”更正式。然而，当前的垃圾邮件过滤器是非常有效的，考虑到不确定性的程度和对垃圾邮件正式定义的争论，比预期的更有效，考虑到最先进的信息检索和机器学习方法对看似类似的问题的预期更有效。但它们足够有效吗?哪个更好?如何改进它们?它们的有效性会被更巧妙的垃圾邮件所削弱吗?我们调查了当前和建议的垃圾邮件过滤技术，特别强调了它们的工作效果。我们主要关注的是垃圾邮件的过滤;在其他通信和存储媒体(如即时消息和Web)中，垃圾邮件过滤的异同将在外围解决。在此过程中，我们将研究垃圾邮件的定义、用户的信息需求以及垃圾邮件过滤器作为庞大而复杂的信息世界的一个组成部分所扮演的角色。对众所周知的方法进行了充分的详细说明，使得本文的阐述是独立的，但是，本文的重点是对垃圾邮件的独特考虑。比较，只要可能，使用共同的评价措施，并控制实验设置的差异。这种比较并不容易，因为评估垃圾邮件过滤器的基准、度量和方法仍在不断发展。我们调查了这些努力，他们的结果和他们的局限性。尽管最近在评估方法方面取得了进展，但关于垃圾邮件过滤技术的有效性和垃圾邮件过滤评估方法的有效性仍然存在许多不确定性(包括广泛持有但未经证实的信念)。我们概述了几个不确定性，并提出了实验方法来解决它们。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Email Spam Filtering: A Systematic Review

Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than "I know it when I see it." Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Foundations and Trends in Information Retrieval COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

39.10

自引率

0.00%

发文量

期刊介绍： The surge in research across all domains in the past decade has resulted in a plethora of new publications, causing an exponential growth in published research. Navigating through this extensive literature and staying current has become a time-consuming challenge. While electronic publishing provides instant access to more articles than ever, discerning the essential ones for a comprehensive understanding of any topic remains an issue. To tackle this, Foundations and Trends® in Information Retrieval - FnTIR - addresses the problem by publishing high-quality survey and tutorial monographs in the field. Each issue of Foundations and Trends® in Information Retrieval - FnT IR features a 50-100 page monograph authored by research leaders, covering tutorial subjects, research retrospectives, and survey papers that provide state-of-the-art reviews within the scope of the journal.