基于协同训练的半监督Web垃圾邮件检测

2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) Pub Date : 2013-07-23 DOI:10.1109/FSKD.2013.6816301

Wei Wang, Xiaodong Lee, An-Lei Hu, Guanggang Geng

{"title":"基于协同训练的半监督Web垃圾邮件检测","authors":"Wei Wang, Xiaodong Lee, An-Lei Hu, Guanggang Geng","doi":"10.1109/FSKD.2013.6816301","DOIUrl":null,"url":null,"abstract":"Traditional Web spam classifiers use only labeled data (feature/label pairs) to train. Labeled spam instances, however, are very difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled samples are relatively easy to collect. Semi-supervised learning addresses the classification problem by using large amount of unlabeled data, together with the labeled data, to build better classifiers. This paper proposes two new semi-supervised learning algorithms to boost the performance of Web spam classifiers. The algorithms integrate the traditional co-training with the topological dependency based hyperlink learning. The proposed methods extend our previous work on self-training based semi-supervised Web spam detection. The experimental results with 100/200 labeled samples on the standard WEBSPAM-UK2006 benchmark showed that the algorithms are effective.","PeriodicalId":368964,"journal":{"name":"2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Co-training based semi-supervised Web spam detection\",\"authors\":\"Wei Wang, Xiaodong Lee, An-Lei Hu, Guanggang Geng\",\"doi\":\"10.1109/FSKD.2013.6816301\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Traditional Web spam classifiers use only labeled data (feature/label pairs) to train. Labeled spam instances, however, are very difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled samples are relatively easy to collect. Semi-supervised learning addresses the classification problem by using large amount of unlabeled data, together with the labeled data, to build better classifiers. This paper proposes two new semi-supervised learning algorithms to boost the performance of Web spam classifiers. The algorithms integrate the traditional co-training with the topological dependency based hyperlink learning. The proposed methods extend our previous work on self-training based semi-supervised Web spam detection. The experimental results with 100/200 labeled samples on the standard WEBSPAM-UK2006 benchmark showed that the algorithms are effective.\",\"PeriodicalId\":368964,\"journal\":{\"name\":\"2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FSKD.2013.6816301\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FSKD.2013.6816301","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

传统的Web垃圾邮件分类器只使用标记数据(特征/标签对)进行训练。然而，要获得带标签的垃圾邮件实例非常困难、昂贵或耗时，因为它们需要有经验的人类注释者的努力。同时，未标记的样品相对容易收集。半监督学习通过使用大量未标记数据和标记数据来构建更好的分类器来解决分类问题。本文提出了两种新的半监督学习算法来提高Web垃圾邮件分类器的性能。该算法将传统的协同训练与基于拓扑依赖的超链接学习相结合。所提出的方法扩展了我们之前基于自训练的半监督Web垃圾邮件检测的工作。在WEBSPAM-UK2006标准基准上进行了100/200个标记样本的实验，结果表明算法是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Co-training based semi-supervised Web spam detection

Traditional Web spam classifiers use only labeled data (feature/label pairs) to train. Labeled spam instances, however, are very difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled samples are relatively easy to collect. Semi-supervised learning addresses the classification problem by using large amount of unlabeled data, together with the labeled data, to build better classifiers. This paper proposes two new semi-supervised learning algorithms to boost the performance of Web spam classifiers. The algorithms integrate the traditional co-training with the topological dependency based hyperlink learning. The proposed methods extend our previous work on self-training based semi-supervised Web spam detection. The experimental results with 100/200 labeled samples on the standard WEBSPAM-UK2006 benchmark showed that the algorithms are effective.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)

自引率

0.00%

发文量