The Classification Power of Web Features

Q3 Mathematics

Internet Mathematics Pub Date : 2014-04-18 DOI:10.1080/15427951.2013.850456

M. Erdélyi, A. Benczúr, B. Daróczy, A. Garzó, Tamás Kiss, Dávid Siklósi

{"title":"The Classification Power of Web Features","authors":"M. Erdélyi, A. Benczúr, B. Daróczy, A. Garzó, Tamás Kiss, Dávid Siklósi","doi":"10.1080/15427951.2013.850456","DOIUrl":null,"url":null,"abstract":"Abstract In this article we give a comprehensive overview of features devised for web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy. We collect and handle a large number of features based on recent advances in web spam filtering, including temporal ones; in particular, we analyze the strength and sensitivity of linkage change. We propose new, temporal link-similarity-based features and show how to compute them efficiently on large graphs. We show that machine learning techniques, including ensemble selection, LogitBoost, and random forest significantly improve accuracy. We conclude that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on our dataset and can be further improved only slightly by computationally expensive features. We test our method on three major publicly available datasets: the Web Spam Challenge 2008 dataset WEBSPAM-UK2007, the ECML/PKDD Discovery Challenge dataset DC2010, and the Waterloo Spam Rankings for ClueWeb09. Our classifier ensemble sets the strongest classification benchmark compared to participants of the Web Spam and ECML/PKDD Discovery Challenges as well as the TREC Web track. To foster research in the area, we make several feature sets and source codes public,1 https://datamining.sztaki.hu/en/download/web-spam-resources including the temporal features of eight .uk crawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.","PeriodicalId":38105,"journal":{"name":"Internet Mathematics","volume":"10 1","pages":"421 - 457"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15427951.2013.850456","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Internet Mathematics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/15427951.2013.850456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 5

Abstract

Abstract In this article we give a comprehensive overview of features devised for web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy. We collect and handle a large number of features based on recent advances in web spam filtering, including temporal ones; in particular, we analyze the strength and sensitivity of linkage change. We propose new, temporal link-similarity-based features and show how to compute them efficiently on large graphs. We show that machine learning techniques, including ensemble selection, LogitBoost, and random forest significantly improve accuracy. We conclude that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on our dataset and can be further improved only slightly by computationally expensive features. We test our method on three major publicly available datasets: the Web Spam Challenge 2008 dataset WEBSPAM-UK2007, the ECML/PKDD Discovery Challenge dataset DC2010, and the Waterloo Spam Rankings for ClueWeb09. Our classifier ensemble sets the strongest classification benchmark compared to participants of the Web Spam and ECML/PKDD Discovery Challenges as well as the TREC Web track. To foster research in the area, we make several feature sets and source codes public,1 https://datamining.sztaki.hu/en/download/web-spam-resources including the temporal features of eight .uk crawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.

查看原文本刊更多论文

Web特性的分类能力

在这篇文章中，我们给出了一个全面的概述为网络垃圾邮件检测设计的特征，并研究了多少不同的类，其中一些需要非常高的计算量，增加了分类的准确性。我们根据网络垃圾邮件过滤的最新进展收集和处理大量功能，包括临时功能;特别地，我们分析了连杆变化的强度和灵敏度。我们提出了新的基于时间链接相似度的特征，并展示了如何在大型图上有效地计算它们。我们表明，包括集成选择、LogitBoost和随机森林在内的机器学习技术显著提高了准确性。我们得出的结论是，通过适当的学习技术，一个简单且计算成本低的特征子集优于迄今为止在我们的数据集上发表的所有先前结果，并且可以通过计算成本高的特征进一步改进。我们在三个主要的公开数据集上测试了我们的方法:Web垃圾邮件挑战2008数据集WEBSPAM-UK2007, ECML/PKDD发现挑战数据集DC2010，以及ClueWeb09的滑铁卢垃圾邮件排名。与Web垃圾邮件和ECML/PKDD发现挑战以及TREC Web赛道的参与者相比，我们的分类器集成设置了最强的分类基准。为了促进该领域的研究，我们公开了几个功能集和源代码，1 https://datamining.sztaki.hu/en/download/web-spam-resources包括八个。uk抓取快照的时间特征，其中包括WEBSPAM-UK2007以及ClueWeb09标记部分的网络垃圾邮件挑战特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Internet Mathematics Mathematics-Applied Mathematics

自引率

0.00%

发文量