Co-training based semi-supervised Web spam detection

Wei Wang, Xiaodong Lee, An-Lei Hu, Guanggang Geng
{"title":"Co-training based semi-supervised Web spam detection","authors":"Wei Wang, Xiaodong Lee, An-Lei Hu, Guanggang Geng","doi":"10.1109/FSKD.2013.6816301","DOIUrl":null,"url":null,"abstract":"Traditional Web spam classifiers use only labeled data (feature/label pairs) to train. Labeled spam instances, however, are very difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled samples are relatively easy to collect. Semi-supervised learning addresses the classification problem by using large amount of unlabeled data, together with the labeled data, to build better classifiers. This paper proposes two new semi-supervised learning algorithms to boost the performance of Web spam classifiers. The algorithms integrate the traditional co-training with the topological dependency based hyperlink learning. The proposed methods extend our previous work on self-training based semi-supervised Web spam detection. The experimental results with 100/200 labeled samples on the standard WEBSPAM-UK2006 benchmark showed that the algorithms are effective.","PeriodicalId":368964,"journal":{"name":"2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FSKD.2013.6816301","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Traditional Web spam classifiers use only labeled data (feature/label pairs) to train. Labeled spam instances, however, are very difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled samples are relatively easy to collect. Semi-supervised learning addresses the classification problem by using large amount of unlabeled data, together with the labeled data, to build better classifiers. This paper proposes two new semi-supervised learning algorithms to boost the performance of Web spam classifiers. The algorithms integrate the traditional co-training with the topological dependency based hyperlink learning. The proposed methods extend our previous work on self-training based semi-supervised Web spam detection. The experimental results with 100/200 labeled samples on the standard WEBSPAM-UK2006 benchmark showed that the algorithms are effective.
基于协同训练的半监督Web垃圾邮件检测
传统的Web垃圾邮件分类器只使用标记数据(特征/标签对)进行训练。然而,要获得带标签的垃圾邮件实例非常困难、昂贵或耗时,因为它们需要有经验的人类注释者的努力。同时,未标记的样品相对容易收集。半监督学习通过使用大量未标记数据和标记数据来构建更好的分类器来解决分类问题。本文提出了两种新的半监督学习算法来提高Web垃圾邮件分类器的性能。该算法将传统的协同训练与基于拓扑依赖的超链接学习相结合。所提出的方法扩展了我们之前基于自训练的半监督Web垃圾邮件检测的工作。在WEBSPAM-UK2006标准基准上进行了100/200个标记样本的实验,结果表明算法是有效的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信