基于内容的随机漫步算法的垃圾邮件检测

SMUC '10 Pub Date : 2010-10-30 DOI:10.1145/1871985.1871994

F. Javier Ortega, C. Macdonald, J. A. Troyano, Fermín L. Cruz

{"title":"基于内容的随机漫步算法的垃圾邮件检测","authors":"F. Javier Ortega, C. Macdonald, J. A. Troyano, Fermín L. Cruz","doi":"10.1145/1871985.1871994","DOIUrl":null,"url":null,"abstract":"In this work we tackle the problem of the spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phenomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to obtain an a-priori estimation of the spam likekihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) is a web page, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other link-based techniques for spam detection.","PeriodicalId":244822,"journal":{"name":"SMUC '10","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Spam detection with a content-based random-walk algorithm\",\"authors\":\"F. Javier Ortega, C. Macdonald, J. A. Troyano, Fermín L. Cruz\",\"doi\":\"10.1145/1871985.1871994\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work we tackle the problem of the spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phenomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to obtain an a-priori estimation of the spam likekihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) is a web page, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other link-based techniques for spam detection.\",\"PeriodicalId\":244822,\"journal\":{\"name\":\"SMUC '10\",\"volume\":\"85 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SMUC '10\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1871985.1871994\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SMUC '10","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1871985.1871994","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

在这项工作中，我们解决了Web上的垃圾邮件检测问题。垃圾网页已经成为web搜索引擎的一个问题，因为这种现象会对它们的检索结果造成负面影响。我们的方法基于随机游走算法，该算法根据页面的相关性和垃圾邮件的可能性获得页面的排名。我们引入了考虑网页内容的新颖性来表征网页图，并获得网页垃圾邮件可能性的先验估计。我们的基于图的算法为图中的每个节点计算两个分数。直观地说，根据文本内容和图中的关系，这些值表示网页的好坏(像垃圾邮件或不像垃圾邮件)。我们的实验表明，我们提出的技术优于其他基于链接的垃圾邮件检测技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Spam detection with a content-based random-walk algorithm

In this work we tackle the problem of the spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phenomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to obtain an a-priori estimation of the spam likekihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) is a web page, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other link-based techniques for spam detection.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SMUC '10

自引率

0.00%

发文量