An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach

A. Shahzad, Nazri M. Nawi, M. Z. Rehman, Abdullah Khan
{"title":"An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach","authors":"A. Shahzad, Nazri M. Nawi, M. Z. Rehman, Abdullah Khan","doi":"10.1155/2021/6625739","DOIUrl":null,"url":null,"abstract":"In this modern era, people utilise the web to share information and to deliver services and products. The information seekers use different search engines (SEs) such as Google, Bing, and Yahoo as tools to search for products, services, and information. However, web spamming is one of the most significant issues encountered by SEs because it dramatically affects the quality of SE results. Web spamming’s economic impact is enormous because web spammers index massive free advertising data on SEs to increase the volume of web traffic on a targeted website. Spammers trick an SE into ranking irrelevant web pages higher than relevant web pages in the search engine results pages (SERPs) using different web-spamming techniques. Consequently, these high-ranked unrelated web pages contain insufficient or inappropriate information for the user. To detect the spam web pages, several researchers from industry and academia are working. No efficient technique that is capable of catching all spam web pages on the World Wide Web (WWW) has been presented yet. This research is an attempt to propose an improved framework for content- and link-based web-spam identification. The framework uses stopwords, keywords’ frequency, part of speech (POS) ratio, spam keywords database, and copied-content algorithms for content-based web-spam detection. For link-based web-spam detection, we initially exposed the relationship network behind the link-based web spamming and then used the paid-link database, neighbour pages, spam signals, and link-farm algorithms. Finally, we combined all the content- and link-based spam identification algorithms to identify both types of spam. To conduct experiments and to obtain threshold values, WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets were used. A promising F-measure of 79.6% with 81.2% precision shows the applicability and effectiveness of the proposed approach.","PeriodicalId":72654,"journal":{"name":"Complex psychiatry","volume":"48 1","pages":"6625739:1-6625739:18"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex psychiatry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2021/6625739","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

In this modern era, people utilise the web to share information and to deliver services and products. The information seekers use different search engines (SEs) such as Google, Bing, and Yahoo as tools to search for products, services, and information. However, web spamming is one of the most significant issues encountered by SEs because it dramatically affects the quality of SE results. Web spamming’s economic impact is enormous because web spammers index massive free advertising data on SEs to increase the volume of web traffic on a targeted website. Spammers trick an SE into ranking irrelevant web pages higher than relevant web pages in the search engine results pages (SERPs) using different web-spamming techniques. Consequently, these high-ranked unrelated web pages contain insufficient or inappropriate information for the user. To detect the spam web pages, several researchers from industry and academia are working. No efficient technique that is capable of catching all spam web pages on the World Wide Web (WWW) has been presented yet. This research is an attempt to propose an improved framework for content- and link-based web-spam identification. The framework uses stopwords, keywords’ frequency, part of speech (POS) ratio, spam keywords database, and copied-content algorithms for content-based web-spam detection. For link-based web-spam detection, we initially exposed the relationship network behind the link-based web spamming and then used the paid-link database, neighbour pages, spam signals, and link-farm algorithms. Finally, we combined all the content- and link-based spam identification algorithms to identify both types of spam. To conduct experiments and to obtain threshold values, WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets were used. A promising F-measure of 79.6% with 81.2% precision shows the applicability and effectiveness of the proposed approach.
一种改进的基于内容和链接的网络垃圾邮件检测框架:一种组合方法
在这个现代时代,人们利用网络来分享信息,提供服务和产品。信息搜索者使用不同的搜索引擎(如Google、Bing和Yahoo)作为搜索产品、服务和信息的工具。然而,web垃圾邮件是SE遇到的最严重的问题之一,因为它极大地影响了SE结果的质量。网络垃圾邮件的经济影响是巨大的,因为网络垃圾邮件发送者将大量的免费广告数据编入索引,以增加目标网站的网络流量。垃圾邮件发送者使用不同的网络垃圾邮件技术,欺骗网站服务提供商将不相关的网页排在搜索引擎结果页面(serp)的相关网页之前。因此,这些排名靠前的不相关网页对用户来说包含了不足或不适当的信息。为了检测垃圾网页,一些来自工业界和学术界的研究人员正在努力。目前还没有一种有效的技术能够捕获万维网(WWW)上的所有垃圾网页。本研究试图提出一种基于内容和链接的网络垃圾邮件识别改进框架。该框架使用停止词、关键字频率、词性比率、垃圾邮件关键字数据库和复制内容算法进行基于内容的web垃圾邮件检测。对于基于链接的网络垃圾邮件检测,我们首先暴露了基于链接的网络垃圾邮件背后的关系网络,然后使用付费链接数据库、邻居页面、垃圾邮件信号和链接场算法。最后,我们结合了所有基于内容和基于链接的垃圾邮件识别算法来识别这两种类型的垃圾邮件。为了进行实验并获得阈值,使用WEBSPAM-UK2006和WEBSPAM-UK2007数据集。有希望的f值为79.6%,精度为81.2%,表明了所提出方法的适用性和有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信