Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.) Pub Date : 2004-06-17 DOI:10.1145/1017074.1017077

Dennis Fetterly, M. Manasse, Marc Najork

{"title":"Spam, damn spam, and statistics: using statistical analysis to locate spam web pages","authors":"Dennis Fetterly, M. Manasse, Marc Najork","doi":"10.1145/1017074.1017077","DOIUrl":null,"url":null,"abstract":"The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call \"web spam\", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":"198 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"350","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1017074.1017077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 350

Abstract

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.

查看原文本刊更多论文

垃圾邮件，该死的垃圾邮件和统计:使用统计分析来定位垃圾网页

搜索引擎对商业网站的重要性日益增加，导致了一种我们称之为“垃圾网页”的现象，即存在的网页只是为了误导搜索引擎(错误地)将用户引导到某些网站。网络垃圾邮件对用户和搜索引擎来说都是一个麻烦:用户很难找到他们需要的信息，而搜索引擎不得不处理膨胀的语料库，这反过来又导致每次查询的成本增加。因此，搜索引擎有强烈的动机从他们的索引中清除垃圾网页。我们建议可以通过统计分析来识别一些垃圾网页:某些类别的垃圾网页，特别是那些由机器生成的垃圾网页，在某些属性上与一般网页的属性不同。我们研究了各种这样的属性，包括链接结构、页面内容和页面演变，并发现这些属性的统计分布中的异常值极有可能是由web垃圾邮件引起的。本文描述了我们所研究的特性，给出了我们所观察到的统计分布，并展示了哪些异常值与网络垃圾邮件高度相关。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)

自引率

0.00%

发文量