网络垃圾邮件检测综述:原理和算法

SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining Pub Date : 2012-05-01 DOI:10.1145/2207243.2207252

N. Spirin, Jiawei Han

{"title":"网络垃圾邮件检测综述:原理和算法","authors":"N. Spirin, Jiawei Han","doi":"10.1145/2207243.2207252","DOIUrl":null,"url":null,"abstract":"Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a lot of interest both from academia and industry. In this paper we present a systematic review of web spam detection techniques with the focus on algorithms and underlying principles. We categorize all existing algorithms into three categories based on the type of information they use: content-based methods, link-based methods, and methods based on non-traditional data such as user behaviour, clicks, HTTP sessions. In turn, we perform a subcategorization of link-based category into five groups based on ideas and principles used: labels propagation, link pruning and reweighting, labels refinement, graph regularization, and featurebased. We also define the concept of web spam numerically and provide a brief survey on various spam forms. Finally, we summarize the observations and underlying principles applied for web spam detection.","PeriodicalId":90050,"journal":{"name":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","volume":"48 1","pages":"50-64"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"267","resultStr":"{\"title\":\"Survey on web spam detection: principles and algorithms\",\"authors\":\"N. Spirin, Jiawei Han\",\"doi\":\"10.1145/2207243.2207252\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a lot of interest both from academia and industry. In this paper we present a systematic review of web spam detection techniques with the focus on algorithms and underlying principles. We categorize all existing algorithms into three categories based on the type of information they use: content-based methods, link-based methods, and methods based on non-traditional data such as user behaviour, clicks, HTTP sessions. In turn, we perform a subcategorization of link-based category into five groups based on ideas and principles used: labels propagation, link pruning and reweighting, labels refinement, graph regularization, and featurebased. We also define the concept of web spam numerically and provide a brief survey on various spam forms. Finally, we summarize the observations and underlying principles applied for web spam detection.\",\"PeriodicalId\":90050,\"journal\":{\"name\":\"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining\",\"volume\":\"48 1\",\"pages\":\"50-64\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"267\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2207243.2207252\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2207243.2207252","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 267

摘要

搜索引擎实际上成为了在网络上开始获取信息的地方。虽然由于网络垃圾现象，搜索结果并不总是如预期的那样好。此外，垃圾邮件的发展使得提供高质量搜索的问题更具挑战性。在过去的十年中，对抗性信息检索的研究得到了学术界和工业界的广泛关注。在本文中，我们提出了一个系统的审查网络垃圾邮件检测技术的重点算法和基本原则。我们根据它们使用的信息类型将所有现有算法分为三类:基于内容的方法、基于链接的方法和基于非传统数据(如用户行为、点击、HTTP会话)的方法。反过来，我们根据使用的思想和原则将基于链接的类别分为五组:标签传播、链接修剪和重加权、标签细化、图正则化和基于特征。我们还对网络垃圾邮件的概念进行了数值定义，并对各种形式的垃圾邮件进行了简要调查。最后，我们总结了应用于网络垃圾邮件检测的观察结果和基本原理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Survey on web spam detection: principles and algorithms

Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a lot of interest both from academia and industry. In this paper we present a systematic review of web spam detection techniques with the focus on algorithms and underlying principles. We categorize all existing algorithms into three categories based on the type of information they use: content-based methods, link-based methods, and methods based on non-traditional data such as user behaviour, clicks, HTTP sessions. In turn, we perform a subcategorization of link-based category into five groups based on ideas and principles used: labels propagation, link pruning and reweighting, labels refinement, graph regularization, and featurebased. We also define the concept of web spam numerically and provide a brief survey on various spam forms. Finally, we summarize the observations and underlying principles applied for web spam detection.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining

自引率

0.00%

发文量